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Abstract 

In  this  thesis  we  explore  means  by  which  hypercubes  can  compute  despite  faulty  proces¬ 
sors  and  links.  We  also  study  techniques  which  enable  hypercubes  to  simulate  dynamically 
changing  networks  and  data  structures. 

In  chapter  two,  we  investigate  strategies  for  routing  permutations  on  faulty  hypercubes. 
We  assume  that  each  node  or  edge  in  the  hypercube  fails  with  fixed  probability  p  <  1-  yi/2 
and  that  failures  are  independent  of  one  another.  We  describe  a  constant  c  >  0  and  a  routing 
algorithm  which  successfully  routes  messages  between  working  processors  in  O(log  JV)  steps 
on  an  N-node  faulty  hypercube,  with  probability  1  -  N~‘.  We  also  strengthen  an  algorithm 
due  to  Rabin  which  uses  a  redundant  encoding  of  each  message  into  log  N  pieces  which  are 
routed  along  node-disjoint  paths.  A  destination  can  reconstruct  the  original  message  as  long 
as  at  least  logN/2  pieces  arrive  intact.  We  show  that  all  messages  are  reconstructable  at 
their  destinations  with  high  probability,  given  that  each  node  or  edge  fails  with  probability 
0(1/ log  iV)  and  that  each  message  has  ^(log^iV)  bits.  This  guarantee  obtains  even  if  the 
components  fail  during  the  course  of  the  algorithm. 

In  chapter  three,  we  develop  techniques  for  reconfiguring  hypercubes  in  the  presence  of 
faults.  Again  assuming  constant  probabilities  of  failure  amd  the  independence  of  faults,  we 
show  that  a  faulty  hypercube  can  simulate  a  fault-free  hypercube  of  the  same  size  with  only 
constant  delay.  We  exhibit  both  deterministic  and  randomized  algorithms  for  hypercube 
reconfiguration.  We  show  that  there  exists  a  constant  >  0  such  that  with  probability 
1  -  N~*‘'  the  deterministic  algorithm  finds  a  one-to-one  embedding  with  dilation  3  and 
O(logiV)  congestion.  We  also  show  that  there  exists  a  constant  c"  >  0  such  that  with 
probability  1  -  N~*‘"  the  randomized  algorithm  finds  an  embedding  with  constant  load  and 
congestion  with  dilation  5. 

In  chapter  four,  we  tom  our  attention  to  the  embedding  of  dynamically  growing  data 
Structures  in  the  hypercube.  Specifically,  we  show  that  an  arbitrarily  growing  binary  tree 
with  a  maximum  of  M  nodes  can  be  embedded  in  an  N-node  hypercube  with  load  0(jf-h  1), 
congestion  -t- 1)  and  dilation  12,  with  high  probability.  We  also  show  how  to  embed  a 
dynamic  Jlf-node  binary  tree  in  an  N-node  butterfly  with  0(^  +  logN)  load  and  dilatimi 
2,  with  high  probability. 
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Chapter  1 


Introduction 


1.1  Hypercubes 

The  hypercube  has  emerged  as  one  of  the  most  effective  and  popular  network  archi¬ 
tectures  for  large  scale  parallel  computers.  The  Connection  Machine,  manufactured 
and  sold  by  Thinking  Machines  Corp.,  is  a  hypercube-based  machine  containing  2^* 
processing  elements.  Machines  based  on  hypercube  architectures  have  been  built  by 
Intel,  Ncube,  Caltech  and  others.  It  has  been  predicted  that  in  the  not-too-distant 
future,  hypercube-based  machines  containing  up  to  a  million  processors  will  be  avail¬ 
able.  Thus,  current  conditions  point  to  the  utility  of  more  advanced  methods  for 
hypercube  computation. 

The  n-dimensional  hypercube  Hn  is  a  graph  with  N  =  2**  nodes  and  Nn/2  edges. 
The  nodes  of  Hn  ere  labeled  with  n-bit  binary  strings,  and  two  nodes  are  linked  by 
an  edge  if  the  associated  strings  differ  in  precisely  one  bit.  If  the  differing  bit  is  in  the 

position  (1  <  >  <  n)  then  the  associated  edge  is  called  a  dimension  i  edge.  The 
neighbor  of  a  node  v  across  the  i*^  dimension  will  be  denoted  by  v*.  Similarly  v" '’"*** 
will  denote  the  node  reached  from  v  by  traversing  dimensions  ti,t3,...,tfc  (that  is, 
by  flipping  those  bits).  We  will  use  n  and  log  N  interchangeably.  Pictures  of  labeled 
two  and  three  dimensional  hypercubes  and  an  unlabeled  four  dimensional  hypercube 
appear  in  figures  1-1  and  1-2. 

In  hypercube-based  machines,  the  nodes  of  the  graph  are  replaced  by  processors 
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and  the  edges  are  replaced  by  links  between  the  processors.  For  example,  in  the 
Connection  Machine  each  node  of  a  12-dimensional  hypercube  contains  a  group  of  16 
processors. 

The  effectiveness  of  the  hypercube  for  parallel  computation  arises  from  the  wealth 
of  special-purpose  algorithms  written  for  it,  its  support  of  algorithms  written  for 
shared-memory  machines  and  its  ability  to  simulate  a  host  of  other  networks.  Many 
algorithms  which  run  quickly  on  the  hypercube  already  exist.  Further,  the  hyper¬ 
cube’s  recursive  structure  and  high  connectivity  make  it  likely  that  fast  hypercube 
algorithms  will  continue  to  be  invented  in  other  contexts. 

Hypercubes  have  demonstrated  their  usefulness  as  general-purpose  computers  as 
well.  Fast  routing  algorithms  ([VB],  [Ran],  (Pj)  allow  for  low-overhead  interprocessor 
communication.  These  2Llgorithms  enable  the  hypercube  to  simulate  a  parallel  random 
access  machine,  or  PRAM,  with  only  logarithmic  delay.  Since  any  set  of  messages 
are  deliverable  in  0(log  N)  time,  each  set  of  memory  accesses  can  be  simulated  in 
O(logA^)  time  as  well,  even  if  the  PRAM’s  processors  and  memory  locations  are 
spread  arbitrarily  among  the  hypercube’s  processors. 

Hypercubes  perform  even  more  admirably  when  simulating  special-purpose  net¬ 
works.  The  hypercube  can  simulate  meshes,  multidimensional  arrays,  binary  trees, 
x-trees,  pyramid  graphs,  butterflies,  cube-connected  cycles  and  other  networks,  all 
with  constant  delay.  In  many  cases,  these  other  networks  are  actually  subgraphs  of 
the  cube.  In  these  instances,  the  hypercube  can  simulate  the  special-purpose  network 
with  no  delay  at  all. 

1.2  Robustness 

In  this  thesis  we  will  describe  three  ways  in  which  the  hypercube  is  robust  in  a  chang¬ 
ing  computational  environment.  Specifically,  we  show  how  the  hypercube  can  support 
fault- tolerant  routing,  how  the  hypercube  can  be  easily  reconfigured  in  the  presence 
of  faults  and  how  the  hypercube  can  handle  dynamically  changing  load  requirements. 
In  the  first  two  cases,  the  network  itself  changes  due  to  the  accumulation  of  faulty 
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processors  and  links.  We  show  how  the  network  can  absorb  these  faulty  components 
while  exhibiting  little  or  no  degradation  of  performance.  In  the  third  case,  the  com¬ 
putation  we  expect  the  network  to  perform  changes  in  accordance  with  the  data  in  an 
unpredictable  fashion.  We  show  how  the  network  can  distribute  the  resulting  com¬ 
putational  load  <is  optimally  as  if  it  had  been  completely  specified  beforehand.  In  ail 
three  cases,  a  probabilistic  approach  helps  us  to  achieve  our  results.  In  some  cases, 
we  prove  that  these  results  would  be  impossible  if  randomness  were  not  available. 

In  chapters  two  and  three,  we  explore  fault  tolerant  properties  of  the  hypercube. 
We  assume  that  each  node  or  edge  has  some  constant  probability  of  failure.  In  chapter 
two  we  exhibit  two  randomized  algorithms  for  routing  permutations  on  hypercubes 
in  the  presence  of  faulty  components.  Both  algorithms  are  based  on  Valiant  and 
Brebner’s  ([VB])  original  randomized  algorithm  for  routing  permutations  on  hyper¬ 
cubes.  In  the  first  algorithm,  we  modify  the  fault-free  algorithm  so  that  messages 
avoid  faults.  In  the  second  algorithm,  packets  are  broken  into  pieces  containing  re¬ 
dundant  information.  Since  only  a  constzmt  fraction  of  the  pieces  need  to  get  through 
to  reconstruct  the  original  packet,  the  algorithm  can  tolerate  the  loss  of  many  pieces 
due  to  faults.  To  route  a  permutation,  neither  algorithm  takes  more  than  a  constant 
factor  more  time  than  is  required  to  route  without  faults. 

Chapter  three  is  devoted  to  reconfiguration  algorithms.  The  effect  of  these  al¬ 
gorithms  is  that  the  nonfaulty  processors  of  a  hypercube  with  faults  simrUate  the 
processors  of  a  completely  functioning  hypercube.  The  link  connecting  two  proces¬ 
sors  in  the  completely  functioning  hypercube  appears  as  a  functioning  path  between 
the  nodes  simulating  them  in  the  cube  with  faults.  In  chapter  three,  we  describe 
reconfiguration  algorithms  which  enable  a  hypercube  with  many  faults  to  compute  as 
efficiently  as  a  hypercube  of  the  same  size  without  faults. 

The  efficient  simulation  of  dynamically  evolving  computation  structures  is  the 
subject  of  chapter  four.  We  show  that  a  hypercube  can  simulate  an  arbitrarily  growing 
binary  tree  with  only  constant  overhead.  As  the  tree  evolves,  new  nodes  are  assigned 
to  hypercube  processors.  Neighbors  in  the  tree  are  simulated  by  hypercube  processors 
only  a  constant  distance  apart.  For  any  tree,  the  randomized  algorithm  assigns  only 
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a  constant  number  of  tree  nodes  to  each  processor  with  a  probability  that  can  be 
made  arbitrarily  close  to  1.  Thus  both  computation  and  communication  overhead 
are  minimized. 

In  sections  1.3  -  1.5,  we  give  an  overview  of  the  results  in  each  of  chapters  two, 
three  and  four. 


1.3  Fault- Tolerant  Routing 

Given  a  network  with  a  large  number  of  components,  we  must  assume  that  some  of 
these  components  will  fail.  These  faults  may  be  introduced  when  the  machine  is  first 
built,  or  might  accumulate  over  time.  We  would  like  the  machine  to  work  despite  the 
faults. 

Currently,  when  a  processor  or  connection  in  the  Connection  Machine  fails,  the 
board  containing  the  offending  component  is  removed  and  replaced  with  a  functional 
board.  At  some  point  in  the  future,  if  and  when  very  large  machines  are  in  gen¬ 
eral  use,  fault-tolerant  algorithms  may  well  provide  a  viable  alternative  to  wholesale 
replacement.  Such  algorithms  might  enable  the  machine  to  correct  itself,  with  no 
outside  intervention. 

Fault-tolerant  behavior  will  be  a  major  focus  of  our  work.  Routing  in  the  presence 
of  faults,  which  we  study  in  chapter  two,  requires  techniques  for  either  stepping  around 
faults  or  coping  with  messages  which  run  into  faults.  Attempts  have  been  made  on 
both  of  these  fronts.  We  consider  a  routing  algorithm  successful  if  every  packet  sent 
from  a  working  processor  to  another  working  processor  arrives  intact.  Of  course,  this 
view  presupposes  that  the  higher-level  algorithm  in  effect  is  also  tolerant  of  faulty 
processors.  For  example,  a  PRAM  algorithm  would  have  to  tolerate  some  pattern  of 
faults  among  the  PRAM’s  processors.  Such  algorithms  have  yet  to  be  designed. 

Throughout  chapter  two,  we  assume  that  there  is  some  fixed  probability  p  (either 
a  constant  or  a  function  of  the  number  N  of  nodes  in  the  network)  such  that  each 
component  of  the  hypercube  fails  with  probability  p.  Furthemoore,  we  will  assume 
that  the  failure  of  any  given  component  is  independent  of  the  status  of  other  parts 
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of  the  network.  In  some  cases,  this  independence  assumption  may  be  unreasonable. 
Components  which  share  a  physical  location  such  as  a  chip  or  a  board  might  have  a 
greater  chance  of  failing  in  tandem.  In  this  situation,  our  results  can  scale  to  work 
in  a  hierarchical  fashion.  We  may  regard  any  hypercube  as  a  hypercube  whose  nodes 
are  themselves  hypercubes  (a  cross  product  of  hypercubes).  Thus  we  may  treat  the 
chips  or  boards  as  nodes  in  a  more  coarse-greuned  hypercube. 

Mjiny  of  our  algorithms  are  randomized  as  well.  These  algorithms  have  access 
to  a  source  of  randomness  and  we  only  guarantee  that  they  achieve  desired  results 
an  overwhelmingly  large  fraction  of  the  time.  Specifically,  we  guarantee  that  each 
algorithm  succeeds  with  probability  at  least  1  —  N~'*\  i.e.  that  each  fails  with  a 
probability  that  is  an  inverse  polynomial  in  N.  If  we  can  make  the  exponent  k  as 
large  as  we  like  (perhaps  by  relaxing  constants  in  the  performance  we  desire),  then 
we  say  that  the  algorithm  succeeds  with  high  probability. 

In  [VB],  Valiant  and  Brebner  define  a  set  of  paths  from  sources  to  destinations 
which,  with  high  probability,  allow  all  p^kets  to  arrive  at  their  destinations  in 
O(logiV)  steps.  Two  different  variations  on  Valiant  and  Brebner’s  ideas  allow  us 
to  route  in  the  presence  of  faults.  These  variations  use  dififerent  assumptions  about 
the  prevalence  of  faults,  the  capability  of  processors  to  detect  faults  in  neighboring 
components,  and  the  minimum  size  of  the  packets  that  we  can  route.  In  the  first  case, 
we  assume  that  faults  occur  independently  and  with  constant  probability  p,  that  each 
processor  can  detect  in  one  time  unit  whether  or  not  an  adjacent  node  or  link  has 
failed,  and  that  messages  have  length  n(log  N).  Our  idea  is  for  packets  to  follow  close 
to  the  paths  defined  in  [VB],  but  loosely  enough  that  they  can  avoid  faults  as  they 
encounter  them.  We  show  that  if  each  packet  avoids  faults  by  taking  random  steps 
away  from  its  Valiant-Brebner  path,  then  with  high  probability  each  packet  uses  a 
path  with  only  O(loglV)  edges  and  encounters  only  O(loglV)  other  packets  on  its 
path.  This  shows  that  each  packet  arrives  at  its  destination  in  0(Iog  N)  steps. 

We  devote  the  second  half  of  chapter  two  to  our  improvements  of  an  idea  of  Rabin 
([R]).  In  this  case,  we  assume  that  each  edge  of  the  hypercube  fails  with  probability 
p  =  0(1/ log^  N),  that  processors  remain  ignorant  of  changes  in  the  topology  of  the 
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network,  and  that  packets  have  size  n(Iog*  N).  Under  these  assumptions,  Rabin 
showed  that  if  each  packet  is  split  into  log  N  pieces  and  the  pieces  are  routed  to  the 
packet’s  original  destination  by  node-disjoint  paths,  then  a  constant  fraction  of  each 
packet’s  pieces  will  arrive  intact  at  the  destination.  This  assumes  that  each  piece 
makes  no  attempt  to  avoid  faults.  A  piece  arrives  at  the  destination  if  and  only  if 
no  faults  lie  on  its  path.  Coupled  with  a  method  for  recovering  a  packet  from  a 
constant  fraction  of  its  pieces,  this  strategy  allows  us  to  choose  paths  as  if  the  faults 
were  not  there.  We  describe  a  very  simple  way  to  choose  the  paths — we  use  log  N 
paths  parallel  to  the  Valiant- Brebner  path.  We  are  then  able  to  simplify  the  proof, 
to  allow  node  failures  as  well,  and  to  increase  the  allowable  failure  rates  to  include 
probabilities  as  high  as  p  =  0(1/ n).  (Recently,  Giladi  has  reported  similar  results 

([G]).) 


1.4  Reconfiguration 

Network  reconfiguration  involves  assigning  to  working  components  the  tasks  that 
the  failed  components  would  otherwise  perform.  The  goal  is  to  leave  the  network’s 
processing  power  undiminished  in  the  eyes  of  the  outside  world,  except  perhaps  for 
a  minor  slowdown  in  speed  to  allow  some  components  to  perform  multiple  duty. 
Alternatively,  we  can  view  reconfiguration  as  the  embedding  of  a  fault-free  network 
/f'  of  the  same  size  into  the  working  parts  of  the  faulty  network  Hn^  We  can  show  that 
even  if  a  constant  fraction  of  the  hypercube’s  processors  and  links  fail,  what  remains 
keeps  the  original  cube’s  processing  power  with  only  a  constant  factor  degradation  in 
speed,  with  high  probability. 

We  make  the  same  probabilistic  fault  assumptions  in  chapter  three  that  we  made 
in  chapter  two.  Each  component  fails  with  constant  probability  and  independently 
of  other  components. 

Some  of  our  techniques  may  be  of  use  with  other  hypercube- related  problems.  In 
particular,  there  is  one  simple  observation  that  is  used  in  two  forms  in  section  3.5. 
Although  the  observation  has  probably  been  made  by  others,  it  is  basic  enough  that 
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we  think  it  worth  highlighting  as  a  paradigm  for  distributed  match-making. 

We  will  describe  the  result  in  its  most  basic  form.  Consider  a  collection  of  0( N) 
men  and  Q{N)  women  at  a  dance.  Assume  that  each  man  has  at  least  n(-Y)  female 
friends  and  that  each  woman  has  at  most  0{X)  male  friends.  By  Hall's  marriage 
theorem,  it  is  possible  to  schedule  0(1)  rounds  of  dances  so  that  every  man  dances 
with  at  least  one  friend  and  every  woman  dances  at  most  0(1)  times.  Unfortunately, 
the  problem  of  scheduling  dance  partners  requires  substantial  global  coordination. 
For  our  purposes,  we  focus  on  a  scenario  where  pairing  is  accomplished  simply  by 
a  man  asking  a  woman  to  dance.  If  many  men  ask  a  woman  to  dance  at  once,  she 
accepts  as  many  as  she  can,  making  sure  not  to  exceed  her  capacity  of  O  =  0(1) 
dances  for  the  evening.  If  she  can  only  accept  some  of  the  men,  she  prefers  the 
tallest  among  them.  Each  man  chooses  a  friend  randomly  for  each  dance  (without 
knowledge  of  which  women  are  tired  or  which  women  other  men  are  asking)  until  he 
damces.  The  result  (which  we  call  the  Dance  Hall  Theorem — pun  intended)  is  that  if 
X  =  n(logiV),  and  there  are  n(logiV)  dances,  then  with  high  probability  every  man 
will  dance  during  the  course  of  the  evening.  That  is,  for  any  lower  boimd  bX  on  the 
number  of  female  friends  each  man  has,  any  upper  bound  hfX  on  the  number  of  male 
friends  each  woman  has  and  any  constant  k,  there  is  a  C  such  that  for  sufficiently 
large  N,  with  probability  1  —  iV“*  a  capacity  of  C  is  sufficient. 

The  Dance  Hall  Theorem  scenario  first  arises  in  our  analysis  when  we  attempt  to 
embed  the  nodes  of  in  the  functioning  nodes  of  /f„.  The  nodes  of  correspond 
to  men  and  the  functioning  nodes  of  correspond  to  women.  If  a  man  dances  with 
a  woman,  then  the  corresponding  node  of  will  be  simulated  by  the  corresponding 
node  of  H^.  We  need  the  Dance  Hall  Theorem  to  ensure  that  the  load  of  the  em¬ 
bedding  is  0(1)  (i.e.  every  woman  dances  with  0(1)  men)  and  to  ensure  that  the 
embedding  can  be  constructed  quickly  with  local  control  (no  global  matchmaker). 
We  also  need  some  other  as-yet-undescribed  properties  of  the  Dance  Hall  Theorem 
schedule  to  ensure  that  the  hypercube’s  edges  are  not  overtaxed  by  the  embedding, 
but  these  are  more  technical  in  nature  and  will  be  dealt  with  in  the  main  text. 
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1.5  Dynamic  Load  Balancing 


The  desire  for  the  optimal  use  of  computational  resources  is  often  modelled  as  an 
embedding  problem.  We  construct  a  graph  whose  nodes  represent  the  data  and  pro¬ 
cesses.  An  edge  connects  two  processes  which  trade  information.  To  minimize  com¬ 
putation  time,  we  would  like  to  divide  the  processing  requirements  evenly  among  the 
processors  of  our  network.  To  minimize  communication  time,  we  would  like  to  assign 
neighboring  processes  to  processors  which  are  fairly  close.  These  two  requirements 
may  conflict. 

For  one  solution,  we  might  build  a  network  which  perfectly  mirrors  the  processes 
involved  and  embed  each  process  in  its  own  processor.  There  are  two  problems  with 
this  approach.  First,  every  algorithm  would  require  a  different  network  structure 
depending  upon  how  it  divided  up  the  work.  Worse,  the  same  algorithm  might 
generate  a  different  process  graph  for  different  input  data.  In  this  case  no  foresight 
could  help  in  network  construction.  One  (far  from  unique)  example  can  be  found  in 
the  context  of  branch-and-bound  algorithms.  The  search  tree  developed  during  each 
run  of  a  branch-and-bound  algorithm  changes  based  on  which  subtrees  are  cut  and 
which  are  chosen  for  further  exploration.  We  could  not  hope  to  build  a  processor  tree 
which  could  handle  all  potentialities  unless  it  were  far  larger  than  any  one  tree  that 
might  be  generated  during  any  particular  run. 

As  a  second  solution,  we  might  build  a  network  into  which  all  similarly  sized  trees 
can  be  embedded.  A  practical  network  would  allow  \is  to  embed  a  tree  dynamically. 
As  we  embed  the  tree,  we  have  no  knowledge  of  which  branches  will  develop  many 
nodes  in  the  future,  and  which  will  cease  to  exist  at  all.  We  must  allow  sufficient 
room  for  all  possibilities. 

In  chapter  four,  we  demonstrate  a  randomized  algorithm  which,  with  high  prob¬ 
ability,  embeds  an  arbitrary  dynamic  binary  tree  in  a  hypercube  so  that  the  compu¬ 
tation  and  communication  overhead  are  both  constant.  A  simplified  version  of  the 
algorithm  embeds  a  dynamically  growing  tree  in  a  butterfly  smaller  by  a  logarithmic 
factor.  Both  computation  and  communication  are  slowed  by  only  a  logarithmic  fac- 
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tor,  the  best  possible.  Thus  hypercubes  and  butterflies  can  run  tree-based  algorithms 
as  quickly  as  can  PRAMs. 
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Chapter  2 


Routing  in  the  Presence  of  Faults 


2.1  Introduction 

To  successfully  simulate  shared  memory,  a  parallel  network  must  have  the  ability  to 
route  information  between  different  origin  processors  and  destination  processors  at 
the  same  time.  Since  processors  trade  information  throughout  the  course  of  parallel 
computations,  the  overhead  due  to  the  transmission  of  information  over  the  network 
shows  up  as  a  multiplicative  factor  in  the  time  to  perform  many  tasks.  Thus  the 
routing  question  is  one  of  fundamental  importance. 

In  practice  and  theory,  the  store-and-forward  model  of  communication  is  often 
used.  In  this  model,  once  a  node  begins  transmission  of  a  message  unit  across  a 
link,  it  continues  to  transmit  until  the  entire  message  is  sent.  Treating  messages  as 
inviolable  packets  allows  us  to  ignore  some  significant  issues  of  control  at  the  cost  of 
time.  Since  time  bounds  for  packet-switched  networks  are  often  stated  in  units  of 
packet  steps,  such  bounds  must  be  multiplied  by  the  length  of  the  longest  message  to 
produce  a  bound  in  bit  steps. 

Many  algorithms  have  appeared  for  routing  on  hypercubes  and  networks  derived 
from  hypercubes  (such  as  the  butterfly).  In  1981,  Valiant  and  Brebner  ([VB])  pre¬ 
sented  an  algorithm  for  routing  n(log  iV)-bit  packets  on  the  log  N  x  ^-node  butterfly 
(and  hence  the  iV-node  cube)  which  could  route  permutations  from  the  top  level  to 
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the  bottom  in  O(logiV)  packet  steps,  with  high  probability.^  Here  a  permutation 
means  that  the  mapping  from  origins  to  destinations  is  bijective.  Their  algorithm, 
which  we  will  review  in  section  2.2,  introduced  the  paradigm  of  routing  each  packet 
first  to  a  random  intermediate  destination  and  then  to  its  true  final  destination.  The 
algorithm  routes  obliviously:  each  packet’s  path  is  chosen  without  regard  for  the 
paths  of  any  other  packets. 

This  simple  addition  of  randomness  is  enough  to  overcome  the  proven  delays 
involved  with  deterministic  routing  algorithms.  Borodin  and  Hopcroft  ([BH])  showed 
that  any  deterministic  oblivious  algorithm  must  necessarily  take  fl{y/N/{\ogN)^f^) 
bit  steps,  in  the  worst  case,  for  any  iV-node  network. 

Since  Valiant  and  Brebner’s  pioneering  work,  significant  improvements  have  been 
made.  Pippenger  ((P|)  showed  how  to  route  permutations  of  a  fully  loaded  logNxN 
butterfly  in  0(log  N)  steps  with  high  probability.  That  is,  each  node  in  the  butterfly 
can  generate  a  packet,  not  only  the  nodes  in  the  top  level.  In  Pippenger ’s  algorithm, 
only  a  constant  number  of  packets  reside  in  a  queue  at  any  time.  Ranade  ([Ran]) 
produced  an  algorithm  which  routes  arbitrary  mappings  on  a  fully-loaded  butterfly 
using  combining,  again  with  constant  size  queues  and  in  0(log  N)  packet  steps  with 
high  probability.  Both  of  these  algorithms  make  fundamental  use  of  the  paradigm  of 
routing  to  random  intermediate  destinations. 

2.1.1  Summary  of  Results 

In  this  chapter,  we  consider  the  problem  of  packet  routing  on  a  hypercube  with 
faults.  We  assume  that  every  node  and  link  of  a  hypercube  fails  independently  with 
constant  probability  p.  Under  this  assumption,  with  probability  exponentially  close 
to  1,  a  constant  fraction  of  the  components  of  the  cube  will  fail.  In  the  presence 
of  such  a  large  number  of  faults,  we  would  like  to  route  packets  so  that  any  packet 
generated  by  a  working  node  and  sent  to  a  working  node  arrives  safely  within  the 
stated  time  bound. 

*  We  use  the  phrase  Q  is  leas  than  0(g)  with  high  probability  to  mean  “For  every  k  there  exists 
a  constant  d  independent  of  N  such  that  the  prob^ility  that  Q  exceeds  dg  is  leas  than 
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We  describe  and  analyze  a  randomized  packet  routing  algorithm  that  adaptively 
routes  packets  around  faults  as  they  are  encountered  in  an  N-node  hypercube  that 
contains  @(iV)  randomly  located  faulty  nodes  2md  Q{N log  N)  randomly  located 
faulty  edges.  We  assume  that  each  processor  can  decide  if  an  adjacent  node  or  link 
has  failed.  Also,  each  processor  can  choose  a  random  element  from  a  set  with  as 
many  as  log  N  elements  according  to  the  uniform  distribution.  We  define  the  prop¬ 
erty  of  local  routability,  a  characterization  of  the  connectivity  of  the  network  after 
some  components  have  failed.  There  exists  a  constant  C\  such  that  the  hypercube 
remains  locally  routable  with  probability  1  —  We  prove  that,  given  that  the 

hypercube  is  locally  routable,  the  algorithm  routes  any  permutation  on  the  working 
processors  in  O(log  N)  steps  with  high  probability.  That  is,  under  the  assumption  of 
local  routability,  we  reproduce  Valiant  and  Brebner’s  results  in  the  presence  of  faulty 
components.  Packets  which  start  or  end  at  faulty  nodes  are  eventually  determined 
to  be  undeliverable.  All  the  deliverable  packets  arrive  at  their  destinations  provided 
that  they  are  not  located  in  the  immediate  vicinity  of  a  processor  at  the  moment 
it  fails.  The  algorithm  is  fault- tolerant  in  the  sense  that  no  advance  knowledge  of 
the  locations  of  the  faults  is  needed  for  the  path  selection,  but  it  is  susceptible  to 
nodes  which  fail  while  holding  packets.  The  algorithm  is  of  interest  because  during 
most  steps,  few  processors  will  fail  and  almost  all  deliverable  packets  will  be  deliv¬ 
ered,  In  addition,  the  algorithm  itself  is  quite  simple  and  is  the  first  adaptive  routing 
algorithm  for  which  an  0(log  N)  bound  on  the  routing  time  has  been  achieved. 

Work  on  adaptive  routing  for  faulty  hypercubes  is  potentially  applicable  out¬ 
side  the  setting  of  fault-tolerance.  Except  for  the  algorithm  we  present,  all  known 
O(log  N)  packet  step  routing  algorithms  for  the  hypercube  are  inherently  nonadap- 
tive.  Whereas  n(logAr)  packet  steps  are  also  a  lower  bound  on  the  time  to  route 
(since  the  diameter  of  the  hypercube  is  log  iV),  the  implied  O(log*  N)  bit  step  bound 
for  0(log  iV)-size  packets  is  not  provably  optimal.  Recently,  we  have  proven  a  lower 
bound  of  n(log’  Nf  log  log  N)  bit  steps  for  all  nonadaptive  algorithms  ([ALN]).  Thus, 
serious  improvement  on  the  upper  bound  will  have  to  come  from  an  adaptive  algo¬ 
rithm. 
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There  has  been  other  work  on  packet  routing  on  faulty  hypercubes.  Most  no¬ 
tably,  Rabin  ([R])  has  devised  an  elegant  scheme  called  information  dispersal  routing 
wherein  each  packet  to  be  routed  is  decomposed  into  log  N  pieces.  The  pieces  are 
routed  in  a  randomized  nonadaptive  fashion  to  their  destinations  and  then  recom¬ 
bined  to  form  the  original  message.  A  key  aspect  of  the  scheme  is  that  the  packet 
decomposition  uses  error-correcting  codes.  Therefore  only  a  constant  fraction  of  the 
pieces  of  any  packet  need  to  get  through  to  the  destination  for  the  packet  to  be 
reconstructed. 

Rabin  makes  different  assumptions  about  both  the  nature  of  fault  detection  and 
the  size  of  the  packets.  His  model  assumes  no  detection  of  nearby  faults  is  possible. 
In  his  algorithm,  each  node  chooses  log  N  node-disjoint  paths  on  which  to  send  its 
pieces  without  regard  for  faults  they  may  contain.  If  a  packet  encounters  a  fault,  it 
is  lost.  Rabin’s  scheme  is  useful  only  if  the  original  packets  represent  relatively  long 
bit  streams.  Because  routing  information  alone  uses  @(log  AT)  bits,  each  of  the  log  JV 
pieces  into  which  a  packet  is  divided  must  contain  n(log  N)  bits.  Thus  the  original 
packets  must  have  length  f)(log^  N).  Additionally,  Rabin’s  analysis  depends  on  the 
failure  rate  p  to  be  0(l/log^iV)  and  allows  only  edge  faults.  At  most  Q(N/ log  N) 
edge  faults  can  be  absorbed.  Under  these  conditions,  the  Rabin  algorithm  provides  a 
fully  fault-tolerant  routing  of  N  packets  in  O(log  N)  steps  with  high  probability. 

In  section  2.3,  we  show  how  to  achieve  Rabin’s  results  with  a  simpler  algorithm 
and  analysis.  Our  analysis  permits  both  node  and  edge  faults  and  requires  p  to 
be  0(1/ log  AT)  so  that  the  routing  can  absorb  up  to  0(A^)  edge  faults  as  well  as 
Q{N/  log  N)  node  faults.  (A  similar  result  based  on  Rabin’s  original  algorithm  has 
recently  been  discovered  by  Giladi  ([G]).)  We  also  briefly  sketch  a  way  to  potentially 
improve  its  tolerance  to  faults  in  as  many  as  a  constant  fraction  of  components  by 
combining  the  decomposition  scheme  with  our  adaptive  algorithm  for  routing  around 
faults. 

All  of  chapter  two  represents  joint  work  with  Tom  Leighton.  In  addition,  lemma 
2.7  is  the  result  of  work  with  Bill  Aiello  and  Satish  Rao. 
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2.1.2  Overview 


Section  2.2  contains  the  0(log  A^)  time  adaptive  routing  algorithm.  In  section  2.3,  we 
show  how  to  improve  Rabin’s  fault- tolerant  results  with  a  simpler  algorithm. 


2.2  Fast  Routing  Around  Faults 

In  this  section  we  examine  the  problem  of  routing  a  permutation  on  a  faulty  hyper¬ 
cube.  We  describe  a  variant  of  Valiant-Brebner  routing  on  the  hypercube  that  we  call 
offset  routing.  The  success  of  the  algorithm  depends  on  local  routability,  a  condition 
of  the  nonfaulty  processor’s  connectivity.  We  show  that  with  a  probability  close  to  1 
a  faulty  hypercube  remains  locally  routable  and  that  if  it  does,  the  routing  algorithm 
works  with  high  probability. 

We  make  several  assumptions  about  the  nature  of  faults  and  about  the  abilities 
of  the  network’s  processors.  Every  node  and  edge  of  the  hypercube  is  assumed  to  fail 
independently  of  other  components  and  with  a  constant  probability  p  <  1  —  ^1/2. 
Every  node  is  able  to  detect  whether  a  neighboring  node  or  the  link  to  it  is  faulty  by 
simply  sending  a  one  bit  message  and  waiting  for  a  response.  It  does  not  matter  if 
the  node  cannot  detect  whether  the  fault  lies  in  the  neighbor  or  the  link.  We  make 
the  minimal  assumptions  about  the  messages  themselves.  Since  routing  information 
uses  @(log  W)  bits  and  must  accompany  each  message,  we  assume  that  each  packet 
contains  n(logiV)  bits. 

The  idea  of  the  offset  routing  algorithm  is  to  route  around  the  faulty  components. 
Say  a  hypercube  node  v  holds  a  message  &om  some  source  and  that  the  route  to 
the  destination  dictates  that  the  message  be  sent  to  its  neighbor  v*  across  the 
dimension.  Further  assume  that  the  edge  (v,v*)  has  failed.  One  way  to  pass  the 
message  on  would  be  to  find  a  dimension  i  ^  k  for  which  all  components  in  the  path 
are  nonfaulty.  A  picture  of  this  path  appears  in  figure  2-1. 

Unfortunately,  if  some  node  on  the  path  from  source  to  destination  has  failed 
and  paths  like  that  shown  in  figure  2-1  are  used  exclusively,  the  message  will  not 
get  through.  To  allow  for  the  existence  of  node  faults,  we  make  sure  that  once  we 
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Figure  2*1:  A  path  of  length  three  avoiding  a  faulty  edge. 

have  decided  on  a  path  from  the  source  to  the  destination,  the  message  never  resides 
in  any  of  the  processors  along  the  path  until  it  reaches  its  destination.  The  path  is 
treated  as  a  virtual  path.  Instead  of  residing  in  some  node  v  along  the  virtual  path, 
the  packet  will  reside  in  some  neighboring  node  v*.  That  is,  it  will  be  offset  by  the 
dimension  i.  If  dimension  A:  is  to  be  traversed,  some  other  offset  J  will  be  chosen  for 
which  the  entire  path  (w',  w*^,  is  fault-free.  Thus,  instead  of  residing  in  node 

v*‘,  the  packet  will  be  offset  by  dimension  j.  In  this  fashion,  the  offset  path  skirts 
around  the  virtual  path  but  never  meets  it  until  the  packet  reaches  its  destination. 

The  offset  routing  algorithm  uses  randomness  in  two  different  ways.  First,  ran¬ 
domness  is  used  to  select  virtual  paths  from  sources  to  destinations.  The  virtual 
paths  we  will  use  are  precisely  the  paths  chosen  by  the  Valiant-Brebner  algorithm. 
Second,  the  offsets  used  along  the  way  will  be  chosen  from  among  those  which  create 
a  live  path  of  length  three  to  the  next  offset  node. 

In  section  2.2.1,  we  define  butterflies  and  we  review  the  Valiant-Brebner  routing 
algorithm.  We  prove  some  important  bounds  on  the  number  of  messages  the  algorithm 
is  likely  to  rout  through  small  sets  of  edges.  In  section  2.2.2,  we  define  another 
network,  the  butterfly  with  jump  edges,  which  helps  us  to  think  about  the  offset 
routing  algorithm  on  the  hypercube.  In  section  2.2.3,  we  describe  the  offset  routing 
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algorithm  explicitly.  Section  2.2.4  proves  a  limit  of  0(log  N)  on  the  length  of  any 
offset  path.  Finally,  section  2.2.5  shows  that  only  0(log  N)  other  messages  use  any  of 
the  edges  of  a  particular  message’s  path.  This  proves  that  the  offset  routing  algorithm 
finishes  in  0(log  N)  routing  steps. 

2.2.1  Valiant-Brebner  Routing 

The  virtual  paths  we  will  use  are  those  dictated  by  the  Valiant-Brebner  routing  al¬ 
gorithm.  Since  that  algorithm  is  viewed  more  intuitively  as  a  butterfly  algorithm, 
we  will  present  it  that  way.  First,  we  review  some  basic  butterfly  concepts.  Next  we 
present  the  Valiant-Brebner  routing  algorithm.  Last,  we  prove  two  lemmas  about  how 
uniformly  the  algorithm  uses  edges.  These  lemmas  will  be  useful  when  we  examine 
the  usage  of  edges  by  the  offset  routing  algorithm. 

The  log  N  X  iV-node  or  log  iV- dimensional  butterfly  is  obtained  from  the  iV-node 
hypercube  by  replacing  each  node  v  of  the  cube  by  a  cycle  (vo,  Vi, . . .  ,Un-i,t;o).  We 
replace  each  edge  (u,  u‘)  by  a  pair  of  edges  uj)  and  (u,*_n  t^»)  (mod  n).  We  can 
visualize  the  set  of  nodes  {u,  |u  €  as  sharing  a  level  of  the  butterfly.  We  call  edges 
of  the  form  (v,_xi  vj  straight  edges  and  those  of  the  form  cross  edges.  All 

edges  connect  nodes  in  adjacent  levels  (mod  n). 


Figure  2-2:  A  three  level  butterfly.  (The  top  and  bottom  rows  are  identified.) 
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All  dimension  i  hypercube  edges  connect  the  (i  - 1)'‘  level  with  the  level.  Thus 
amy  hypercube  algorithm  which  only  uses  one  dimension  during  each  step  and  only 
uses  consecutive  dimensions  during  consecutive  steps  can  run  on  the  butterfly  just 
as  quickly.  Any  butterfly  algorithm  works  as  well  on  the  hypercube  from  which  the 
butterfly  was  obtained.  We  may  regain  this  hypercube  by  collapsing  columns  of  the 
butterfly. 

The  Valiant*  Brebner  hypercube  routing  algorithm  is  also  a  butterfly  routing  algo¬ 
rithm.  A  packet  starts  at  some  node  vq  and  ends  at  some  node  wq-  We  think  of  the 
column  of  nodes  {ui}  as  being  shared  by  the  hypercube  node  v,  which  assigns  each 
node  in  the  column  a  different  queue  from  a  set  of  n  queues.  If  a  message  traverses 
the  straight  edge  u,)  in  some  butterfly  step,  then  it  is  passed  from  the  node  v’s 
(i  —  1)**  queue  to  its  queue  in  the  hy’^rcube  step.  If  the  message  traverses  the 
cross  edge  in  some  butterfly  step,  then  it  is  passed  from  v’s  (t  —  1)**  queue 

to  u'’s  queue  in  the  hypercube  step. 

Routing  from  vq  to  tvq  is  simplified  by  the  fact  that  there  is  a  unique  path  of 
length  n  between  those  two  nodes.  The  step  in  the  path  connects  a  node  at  level 
i  -  1  with  one  at  level  t.  If  u  and  w  agree  in  the  bit,  the  edge  is  a  straight  edge. 
If  they  differ,  a  cross  edge  is  used.  For  example,  to  route  from  the  node  (1,  l,0)o  to 
the  node  (0, 1,  l)o  we  would  use  the  path  (1,  l,0)o,  (0,  l,0)i,  (0, 1,0)},  (0, 1,  l)o> 

In  the  first  phase  of  the  Valiant-Brebner  routing  algorithm,  each  node  in  level  0 
first  sends  its  packet  to  a  random  node  in  the  same  level  using  the  unique  path  of 
length  n.  In  the  second  phase,  the  packet  is  routed  along  the  unique  path  to  its  true 
destination.  In  [VB]  it  was  shown  that  this  algorithm  takes  0(n)  steps  to  complete 
and  uses  total  queue  length  0(n)  at  every  hypercube  node,  with  high  probability. 

We  will  worry  about  congestion,  or  the  total  number  of  messages  using  a  given 
set  of  edges,  in  the  offset  routing  schedule.  A  message  can  congest  an  edge  only  if 
its  virtual  path  brings  it  close  to  that  edge.  It  will  then  congest  the  edge  only  if 
particular  choices  of  offset  are  made.  To  bound  the  congestion,  we  will  first  bound 
the  number  of  messages  whose  virtual  paths  come  close  to  a  given  set  of  edges.  We 
need  only  the  following  two  bounds  on  the  number  of  messages  traversing  small  sets 


25 


of  edges  via  their  Valiant- Brebner  paths. 

Lemma  2.1.  Talce  an  arbitrary  set  of  h  edges  on  one  level  of  the  n-dimensionaJ 
butterSy.  Then  with  high  probability  the  Valiant-Brebner  routing  scheme  routes 
only  0{h  -I-  n)  messages  through  edges  in  the  set. 

Proof.  Note  that  each  message  can  congest  at  most  one  edge  in  the  set.  The  following 
analysis  applies  to  the  first  phase  of  the  routing  algorithm.  The  analysis  for  the  second 
phase  is  almost  identical. 

Say  the  edges  share  level  /  of  the  butterfly.  Then  we  can  partition  the  butterfly’s 
first  /  levels  into  N/2‘  nonintersecting  butterflies  Bi,  £3, . . . ,  each  built  from  a 
subcube  with  2'  nodes.  For  a  message  to  route  through  one  of  the  h  edges,  it  must 
start  in  the  same  butterfly  as  the  edge.  Say  that  h,  of  the  edges  lie  in  butterfly  Bi. 
Because  paths  are  chosen  uniformly,  each  message  is  equally  likely  to  traverse  any  of 
the  edges  in  a  level  of  Bi.  Thus  each  message  starting  in  butterfly  Bi  has  probability 
Pi  =  hi/2^  that  it  will  hit  one  of  the  edges  in  the  set. 

For  a  node  v,  let  s  1  if  v’s  packet  congests  an  edge  in  the  set  and  0  otherwise. 
We  wish  to  bound  the  value  of  X  =s  To  do  so,  we  bound  the  moment 

generating  function  A/(A)  =  E[e^^]  for  positive  A.  We  can  then  bound  Pr(X  >  kh]  » 
Pr[e^^  >  This  bound  follows  directly  from  Markov’s  inequality 

Pt\Y  >  6]  <  B(y]/6  for  any  nonnegative  random  variable  Y  and  nonnegative  bound 
b.  We  will  first  bound  the  moment  generating  functions  A/v(A)  =  E[e^^']  We  can 
then  use  the  fact  that,  since  the  X*  are  independent,  Af(A)  =  H  Afv(A). 

The  moment  generating  function  M„{X)  will  depend  on  the  butterfly  Bi  to  which 
V  belongs.  If  u  €  B,  then  -  B[e^-^-]  =  (^e^  +  1  -  )•  Precisely  2^  nodes  in 

the  butterfly  share  this  moment  generating  function.  Thus  the  moment  generating 
function  A/(A)  for  X  satisfies 


M(A) 
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« 

The  inequality  between  lines  two  and  three  follows  from  the  inequality  1  +  x  <  e* 
for  all  X. 

Thus  Pt[X  >  fcfc]  <  c(**~^l*e“**^  =  (c**“*^“*)*.  Setting  A  =  InA;,  this  implies 
Pr[X  >  kh]  <  (g*!!-!®*)-!)*,  a  bound  which  can  be  made  as  small  as  desired  by 
increasing  the  constant  k. 

U  h  >  n,  then  the  probability  that  more  than  kk  messages  pass  through  the 
edges  is  less  than  Similarly,  if  A  <  n,  the  chance  of  having  more  than  kn 

messages  crossing  the  set  is  also  less  tham  I 

Lemma  2.2.  Talre  an  arbitrary  set  of  O(n^)  edges  in  the  n-dimensionaj  butterfly. 
With  high  probability  the  Valiant-Brebner  routing  scheme  routes  onlyO{rP)  messages 
through  edges  in  the  set,  counting  a  message  once  for  each  time  it  traverses  an  edge 
in  the  set  (i.e.  counting  according  to  multiplicity). 

Proof.  We  will  examine  each  level  separately  and  then  sum  across  leveb.  Say  level 
I  has  et  edges  from  the  set.  By  lemma  2.1,  for  any  k  there  is  a  c  such  that  there  is  at 
most  an  iV”*  chance  that  more  than  c(ei  +  n)  messages  traverse  the  ej  edges  from  the 
set  at  level  /.  Summing  over  all  levels,  with  probability  at  least  1  —  niV~*,  the  number 
of  messages  crossing  edges  from  the  set  at  any  level  is  no  more  than  ej  +  n^).  ■ 


2.2.2  Jump  Edges 

As  we  mentioned  earlier,  the  second  use  of  randomness  involves  evading  faults  which 
lie  on  the  virtual  path  chosen  by  the  Valiant-Brebner  routing  algorithm.  When  we 
route  on  the  hypercube,  we  have  access  to  many  more  edges  out  of  each  node  than 
we  do  when  we  route  on  the  butterfly.  We  can  use  these  edges  to  route  around 
faulty  components.  While  bits  are  changed  consecutively  by  traversing  virtual  paths, 
arbitrary  bits  are  changed  during  fault  avoidance.  We  create  the  butterfly  with  jump 


27 


Figure  2-3:  Jump  edges.  These  edges  form  the  hypercube  connections  for  the  nodes 
on  each  level.  The  dashed  edges  are  from  the  underlying  butterfly. 

edges  to  accentuate  the  changing  of  adjacent  bits  in  the  virtual  path  while  allowing 
for  the  changing  of  arbitrary  bits  in  the  offset  path. 

A  jump  edge  is  edge  of  the  type  vj).  Jump  edges  are  not  butterfly  edges.  A 
packet  traversing  such  an  edge  would  be  sent  (in  the  hypercube)  from  the  j*^  queue 
of  V  across  the  edge  (v,  v')  and  deposited  in  the  j**^  queue  of  v'.  Note  that  all  n  jump 
edges  of  the  type  (U;  ,  vj)  j  varying,  arc  actually  manifestations  of  a  single  hypercube 
edge  from  v  to  v*.  This  means  that  every  hypercube  edge  is  represented  n  -I-  2  times 
in  the  butterfly  with  jump  edges:  as  n  different  jump  edges  and  2  cross  edges.  Figure 
2-3  depicts  the  jump  edges  for  the  3x8  butterfly. 

If  we  collapse  the  levels  of  the  butterfly  with  jump  edges,  we  regain  the  hypercube. 
Any  algorithm  we  create  for  the  butterfly  with  jump  edges  works  as  well  on  the 
hypercube.  We  need  only  be  especially  careful  about  congestion,  or  mtiltiple  packets 
crossing  the  same  edge.  A  cross  edge  or  jump  edge  traversed  by  a  given  packet  is 
actually  one  out  of  several  appearances  of  a  hypercube  edge  in  the  butterfly  with 
jump  edges.  Any  congestion  on  another  manifestation  of  the  hypercube  edge  could 
slow  the  packet  down.  Among  other  things,  we  will  concern  ourself  with  the  total 
congestion  on  a  hypercube  edge  traversed  by  a  packet,  not  just  the  congestion  on  the 
particular  cross  edge  or  jump  edge  it  traverses. 
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2.2.3  Offset  Routing 

In  the  offset  routing  algorithm,  each  packet  remains  fairly  close  to  its  Valiant- Brebner 
path.  A  packet’s  location  always  differs  from  where  their  algorithm  would  send  it  by 
some  offset  which  is  a  random  dimension.  The  offset  routing  algorithm  retains  the 
two-phase  structure  of  Valiant  and  Brebner’s  algorithm. 

We  first  describe  how  packets  are  routed  from  level  to  level  in  the  butterfly  with 
jump  edges.  Recall  that  the  path  traversed  by  a  packet  in  the  Valiant-Brebner  scheme 
is  its  virtual  path.  In  the  offset  routing  algorithm,  a  packet  whose  virtual  path  would 
pass  through  the  {k  —  1)**  level  at  the  node  Vk~i  will  pi^s  through  the  level  at  some 
node  instead.  If  its  virtual  path  would  leave  Vk^x  via  a  straight  edge,  then  the 
offset  path  will  traverse  three  edges  of  the  type  v^)-  It  finds  such  a 

path  by  randomly  choosing  a  dimension  j  ^  t  and  attempting  to  route  across  the 
appropriate  three  edges.  If  the  packet  encounters  a  fault  in  any  of  the  three  edges  or 
the  nodes  along  those  three  edges,  it  returns  to  the  node  which  chooses  another 
random  dimension  and  tries  again.  Note  that  this  means  that  a  packet  might  have  to 
traverse  many  more  than  three  edges  to  pass  from  one  level  to  the  next.  If  the  virtual 
path  would  leave  Vk-i  via  a  cross  edge,  then  the  offset  path  traverses  three  edges  of 
the  type  (vi.nuiiii  v*'*,  vj*)  instead.  Note  that  no  matter  whether  straight  edges  or 
cross  edges  are  used  in  the  virtual  path,  the  node  ends  with  a  random  offset  j  from 
its  virtual  location.  If  necessary,  the  bit  is  changed  to  agree  with  the  bit  of 
the  destination.  Figure  2-4  presents  an  offset  path  between  adjacent  levels. 

Each  packet  must  choose  an  initial  offset  to  leave  its  source  and  must  remove 
its  final  offset  to  reach  its  destination.  To  begin,  the  message  generated  by  node 
V  repeatedly  chooses  a  random  dimension  j  and  attempts  to  route  across  the  edge 
(vq,  t^}  until  it  successfully  finds  an  initial  offset.  Say  that  the  message  reaches  the 
0*^  level  at  the  end  of  the  second  phase  with  offset  t  (i.e.  it  reaches  the  node  Wq). 
Then  to  conclude,  the  message  finds  an  offset  j  for  which  the  path  (wq,  ,  u)^,  wq)  is 
fault-free. 

The  offset  routing  algorithm  combines  Valiant  and  Brebner’s  strategy  of  changing 
adjacent  bits  with  a  means  for  avoiding  faults.  In  our  analysis,  we  will  make  fun- 
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Figure  2-4:  A  virtual  edge  between  adjacent  levels  (shown  as  a  dashed  line)  and  a 
possible  offset  path  (shaded).  In  this  example,  t  =  1  and  j  =  3. 

damental  use  of  the  property  of  the  distribution  of  virtual  paths  proven  in  lemma 
2.2.  The  even  distribution  of  virtual  paths  will  help  to  ensure  the  even  distribution 
of  offset  paths  over  the  edges  of  the  hypercube,  assuming  random  offsets  are  chosen. 

2.2.4  The  Length  of  Offset  Paths 

If  a  packet  is  to  arrive  at  its  destination  within  0(log  N)  steps,  certainly  the  path 
it  takes  must  have  length  0(log  N).  In  Valiant  and  Brebner’s  algorithm,  the  length 
of  paths  is  fixed  at  2  log  N,  Offset  paths  are  of  variable  length,  depending  on  faults 
encountered  along  the  way.  In  this  section  we  describe  the  condition  of  local  routabil- 
ity.  We  prove  that  if  the  hypercube  is  locally  routable,  then  with  high  probability  all 
packets  traverse  offset  paths  of  length  0(log  N). 

Essentially,  a  hypercube  is  locally  routable  if  every  node  always  has  ample  oppor¬ 
tunity  to  send  a  packet  to  the  next  level  in  the  butterfly  with  jump  edges.  Consider 
a  path  (uLi.vlLi,  We  assume  that  a  message  has  successfully  arrived  at 

ui.i  and  so  there  are  six  components — three  nodes  and  three  edges — in  the  path  that 
must  all  work  properly.  If  the  probability  of  failure  is  p  <  1  —  ^  (about  0.11)  and 
the  faults  are  independent,  then  each  such  path  has  probability  p^  =  1  —  (1  —  p)*  <  \ 
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that  it  has  a  faulty  component.  For  subsequent  analysis,  we  would  like  it  to  be  the 
case  that  for  all  pairs  Vk-i^i  there  are  at  least  a  constant  fraction  of  offset  dimen¬ 
sions  j  which  lead  the  message  on  a  functioning  path  We  would 

also  like  to  know  that  at  least  a  constant  fraction  of  the  paths  yj) 

are  fault-free  for  all  pairs  To  begin  the  routing,  we  need  that  for  all  nodes 

Vo,  a  constant  fraction  of  the  edges  (vo.vo)  function  properly.  To  end  the  routing, 
we  need  that  for  all  pairs  vo,  t,  a  constant  fraction  of  offset  dimensions  j  lead  the 
message  on  a  functioning  path  Define  the  following  sets  of  paths: 

j  varying},  i  =  {(vLi.  J  varying), 

Qvo  -  j  varying}  and  Q^,i  =  {(vj,vy,yj,vb),  j  varying}.  Fix  an  e,  >  0.  If 

all  possible  sets  Pv»_i.ii  ^*ve  cardinality  at  least  Cpti  we  say 

the  butterfly  is  locally  routable. 

Lemma  2.3.  Assume  that  the  probability  that  any  component  faUs  is  less  than 
1  —  and  that  all  failures  occur  independently.  Then  there  exists  suSciently  small 
€p  >  0  and  Cl  =  ci(ep,p)  such  that  with  probability  the  butterffy  is  locally 
routable. 

Proof.  The  set  of  paths  available  at  vt.|  are  node-disjoint.  (The  same  ar¬ 

gument  holds  for  sets  of  paths  P',  Q  and  C^.)  Thus  the  faultiness  of  any  path  is 
independent  of  other  paths  in  the  set. 

The  probability  that  fewer  than  e^n  paths  arc  fault-free  is 

The  ratio  of  successive  terms  i*  which  is  greater  than  and  bounded 

away  from  1  for  small  enough  Cp.  Thus  the  sum  is  bounded  by  a  constant  times  the 
last  term.  Let  exp](x)  denote  2*.  Then  the  last  term  Itpfi 

=■  (;„)(! 

=  exp3(€,n  log  e  -  CpU  log  e,  -f-  log  (1  -  j/)  -  c^n  logp'  -I-  n  logp') 

=  exp3(h(ep,p')n  +  n  logp') 
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We  can  bound  this  last  expression  using  the  fact  that  each  path  has  a  low  prob¬ 
ability  of  containing  a  failure.  Since  p'  <  logp'  =  -1  -  cj  for  some  cj  >  0.  For 
Cp  =  0,  h(cp,p)  =  0  and  the  above  expression  equals  Since  h{(y,p)  is  con¬ 

tinuous,  there  is  a  Cp  >  0  such  that  the  above  expression  is  bounded  by  N~^~^  with 
C3  >  0.  Since  there  are  only  0(Ariog^ N)  pairs  i,  any  Ci  <  C3  makes  the  lemma 
true.  ■ 


With  probability  N~‘*  for  some  fixed  C4,  some  node  has  only  faulty  neighbors. 
Thus  we  cannot  strengthen  lemma  2.3.  For  the  remainder  of  section  2.2,  we  assume 
the  butterfly  is  locally  routable.  Under  this  assumption,  we  will  prove  that  the 
algorithm  succeeds  quickly  with  high  probability. 


Lemma  2.4.  Say  a  butterfly  has  faulty  components  but  is  locally  routable.  With 
high  probability  each  message  in  the  offset  routing  traverses  a  path  of  length  0(n). 


Proof.  We  will  prove  that  any  given  message’s  path  is  of  length  0(n)  with  high 
probability.  Since  there  are  only  N  messages,  this  will  imply  the  lemma.  Assume 
that  at  some  point  in  its  route,  the  packet  is  at  the  node  v*,  where  v  is  the  node 
it  would  traverse  in  the  Valiant- Brebner  scheme.  Assume  as  well  that  the  packet  is 
scheduled  to  traverse  dimension  k.  (If  the  straight  edge  is  to  be  used  or  if  the  packet 
is  at  the  beginning  or  end  of  the  route,  the  analysis  is  identical.)  Then  if  the  packet 
successfully  chooses  to  jump  across  dimension  j,  the  path  must 

have  no  faults.  Since  the  butterfly  is  locally  routable,  fyn  of  the  possible  paths  to 
choose  are  fault-free.  If  a  faulty  path  is  chosen,  the  packet  encounters  the  fault  and 
returns  to  v|[_|  using  no  more  than  six  edges.  Since  a  random  dimension  is  chosen  at 
each  step,  the  probability  that  a  packet  takes  more  than  66(2n  -I-  2)  steps  is  less  than 
the  probability  of  at  least  (6  -  l)(2n  +  2)  heads  in  a  sequence  of  6(2n  +  2)  tosses  of  a 
coin  with  probability  Cp  of  landing  tails.  This  probability  is  less  than 
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Figure  2*5:  One  packet  might  delay  another  packet’s  progress  even  if  they  never  cross 
paths  in  the  butterfly  with  jump  edges.  One  hypercube  edge  is  replicated  n  +  2  times 
in  the  butterfly  with  jump  edges.  The  two  sections  of  paths  shown  here  intersect  in 
the  hypercube  because  the  darkened  edges  are  actually  one  edge  in  the  cube. 

an  inverse  polynomial  in  N  for  large  enough  b.  ■ 


2.2.5  Delay  From  Other  Packets 

Now  that  we  know  each  message  moves  a  distance  of  0(log  N)  during  an  offset  routing 
phase,  we  need  to  show  that  its  forward  movement  is  delayed  by  at  most  O(log  N) 
other  packets.  These  facts  together  will  bound  the  packet’s  time  to  its  destination. 
We  will  show  that  few  other  packets  choose  virtual  paths  in  such  a  way  that  they 
have  a  non-zero  probability  of  selecting  an  offset  path  which  congests  a  given  node’s 
path.  We  will  then  show  that  even  fewer  of  those  actually  congest  the  path  when 
they  use  offiMt  paths. 

Recall  that  a  cross  edge  or  jump  edge  traversed  by  a  given  packet  is  actually  one 
out  of  several  appearances  of  a  hypercube  edge  in  the  butterfly  with  jump  edges.  Any 
congestion  on  another  manifestation  of  the  hypercube  edge  will  slow  the  packet  down. 
Therefore  we  group  all  n  +  2  copies  of  the  edge  together  and  refer  to  the  group  as  one 
hypercube  edge. 
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Lemma  2.5.  Consider  a  set  E  of  0(n)  hypercube  edges  and  butteriy  straight  edges. 
Let  S  be  the  set  of  butterSy  edges  such  that  any  packet  whose  virtual  path  crosses 
an  edge  in  S  has  a  non-zero  probability  of  congesting  an  edge  in  E  as  a  butterSy  edge 
in  its  offset  path.  Then  with  high  probability,  there  are  O(ti^)  packets  whose  virtual 
paths  traverse  any  of  the  edges  in  S,  counting  a  packet  several  times  if  it  traverses 
several  edges  in  S. 

Proof.  If  {u}ui,wl)  is  a  butterfly  edge  traversed  by  a  packet’s  offset  path  then  the 
packet’s  virtual  path  must  use  an  edge  of  the  form  (u>J\  for  some  pair  i,j.  There 
eire  only  n*  such  pairs.  The  same  reasoning  would  hold  if  the  edge  in  question  were 
a  straight  edge.  Since  |i:|  =  0(n),  |5|  =  O(n^).  By  lemma  2.2,  only  0{n^)  packets 
traverse  edges  in  5,  with  high  probability.  ■ 

Lemma  2.6.  Let  T  be  the  set  of  butterSy  edges  such  that  any  packet  whose  virtual 
path  crosses  an  edge  in  T  has  a  non-zero  probability  of  congesting  some  edge  in 
E  as  a  jump  edge  in  its  offset  path.  Then  with  high  probability,  there  are  O(n^) 
packets  whose  virtual  paths  traverse  any  of  the  edges  in  T,  again  counting  according 
to  multiplicity. 

Proof.  Say  is  a  jump  edge  traversed  by  a  packet.  Let  (vUi»  t'i*) 

or  1  rised  by  the  packet  when  it  traverses  the  jump  edge. 

Then  (u;,  w‘)  is  either  the  first  or  the  last  edge  traversed  in  the  path.  If  it  is  the  first, 
then  Wk-i  =  therefore  I  =  j.  The  edge  traversed  in  the  virtual 

path  would  have  been  (v*_i,t;J)  or  (v*_i,v*)  for  some  k.  There  axe  n  choices  for  t; 
such  that  Vk-i  =  and  n  choices  for  k.  Thus  there  are  only  0(n^)  elements  of  T 
whose  traversal  in  some  packet’s  virtual  path  gives  the  packet  a  non-zero  probability 
of  traversing  the  edge  (tu,  w‘)  as  a  jump  edge.  The  same  reasoning  holds  for  use  of 
the  jump  edge  as  a  third  edge.  Again,  since  \E\  =  0(n),  |r|  s  O(n^).  By  lemma  2.2, 
only  O(n^)  packets  traverse  edges  in  T,  with  high  probability.  ■ 

Lemmas  2.5  and  2.6  also  bold  for  the  set  of  edges  incident  to  the  set  of  nodes 
{vi,}  for  some  hypercube  node  v.  If  we  bound  the  number  of  packets  congesting  these 
edges  then  we  bound  the  number  of  packets  ever  residing  in  queues  in  the  node  v  (the 
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queue  size  of  u). 

We  would  like  to  bound  the  number  of  packets  which  congest  the  path  p©  of 
some  message  mo.  There  are  O(n^)  packets  with  a  nonzero  probability  of  congesting 
some  edge  along  po.  Focus  on  one  of  these  packets  mr-  The  packet  m,  will  cause 
congestion  along  the  path  po  only  if  an  unfortunate  pair  of  offsets  i  and  j  are  chosen 
for  it.  When  the  packet  traverses  from  level  Ir  to  the  level  /r  +  1,  its  offset  i  at  the 
level  is  inherited  from  a  choice  made  by  some  node  in  the  (/r  —  1)**  level.  Then  the 
node  at  level  Ir  chooses  an  offset  j  which  will  route  the  packet  rrir  to  the  (/,  + 1)**  level. 
Because  any  given  fault  can  affect  the  routes  of  several  different  packets,  offset  choices 
made  by  different  packets  are  dependent.  However,  the  packet  m,  is  guaranteed  by 
local  routability  to  have  <7(r)  >  Cpn  choices  of  offset  i  available  at  level  Ir  —  1.  One 
of  those  offsets,  s^y  t«,  will  be  chosen  uniformly  to  route  the  packet  m,  to  level  Ir. 
Once  there,  sotiie  number  of  the  offsets  j  will  cause  the  packet  m,  to  cross  the 
path  Po,  if  the  packet  is  routed  to  level  /r  +  1  using  one  of  those  Or,  offsets.  Since 
we  wish  to  minimize  the  probability  that  such  congestion  occiirs  too  often,  we  are 
concerned  that  choices  for  *,  are  made  which  leave  too  many  unfortunate  choices  of 
j.  By  lemmas  2.5  and  2.6,  we  know  that  there  exists  a  constant  d  such  that  with  high 
probability  J^r  This  follows  because  summing  the  Or*  is  a  second  way  to 

count  virtual  paths  traversing  edges  in  S  and  T,  coimting  a  path  once  for  each  time 
it  traverses  an  edge  in  5  or  T.  Finally,  each  Or,  is  at  most  n,  since  there  will  be  a 
total  of  n  offsets  j  from  which  to  choose.  In  the  next  technical  lemma,  we  use  these 
bounds  on  the  a,,,  or  number  of  unfortunate  choices  of  offset  j,  to  bound  the  number 
of  bad  choices  of  j’s  left  once  all  packets  have  had  the  offsets  i  chosen  for  them. 

Lemma  3.7.  Consider  a  family  of  nonnegative  integers  {ar«|l  <r<z,  1  <s< 
<T(r)}  where  <y(r)  >  for  all  r,  C,,  a,*  <  dn®  and  o,*  <  n  for  all  pairs  r,s.  If 
exactly  one  index  Sr  is  chosen  uniformly  in  (l,<T(r)]  for  each  index  r  then  with  high 
probability  =  0(n®). 

Proof.  Let  Xr  -  Or,,.  A  picture  of  the  choice  of  the  AT,  appears  in  figure  2-6. 

We  wish  to  bound  the  value  of  X  ^  As  in  lemma  2.1,  we  bound  the 

moment  generating  function  A/(A)  =  and  then  we  bound  PrlX  >  brP]  = 
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Figure  2-6:  The  and  a  possible  choice  of  Xr-  Each  row  represents  the  choice 
for  some  packet.  The  entry  otr*  coimts  the  number  of  offsets  j  which,  if  chosen  in 
conjunction  with  the  offset  },,  would  cause  the  packet  to  congest  the  path  po- 
Circled  entries  represent  the  selections  from  each  row. 

PT[e^^  >  As  before,  we  will  first  bound  the  moment  generating 

functions  Mr(A)  =  e**”.  Again,  since  the  Xr  are  independent, 

Af(A)  =  n  Af,(A). 

If  we  could  find  Orc  <  ocry  and  a  positive  integer  6  such  that  0  <  Ory <  n 

then  by  transfering  6  units  from  the  smaller  a,,  to  the  larger  Ory  we  could  only  increase 
A/,(A)  (for  positive  A).  This  follows  because  e^®"  —  =  e^^®’’*~*^(e^*  —  1)  < 

gAa,,^gA«  _  1)  =  gA(o,,+<)  _  The  resultant  change  in  Mr(A),  —  e^®’»)  — 

(gAa,,  _  gA(<*r.-«)j^  would  be  strictly  positive.  By  this  reasoning,  if  Ar  =  is 

fixed,  we  maximize  Mr{X)  by  setting  all  terms  except  possibly  one  equal  to  either  0 
or  n.  Thus 


Me«'l  <  {  +  -'('•)-  0  if  A.  <  n 

“  I  + <'(’■)  -  r^i)  “ 2  » 

For  the  rest  of  the  proof  wefix  A  »  h.  If  A.  <  n  then  ■Af.(A)  <  ^(e^  +(T(r)— 1) 
<  ^(1  +  ^  +  “  1)  <  1  +  (The  second  inequality  uses  the  fact  that  for 

0  <  7  <  1,  <  1+27.)  If  A,  >  n  then  Mp(J)  <  ^(^«  +  <y(r))  <  1  +  ?^. 
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In  either  case  the  bound  is  at  most  1+^  <  exp(^).  Thus  Af(^)  <  fJr  exp(^) 

<  .  Continuing  the  reasoning  of  the  first  paragraph  of  the  proof,  Pr[X  >  fnP] 

iti 

<  .  We  can  make  this  probability  an  arbitrarily  large  negative  power  of  N 
by  letting  b  be  a  large  constant.  ■ 

Each  packet  m,  may  take  several  attempts  before  reaching  level  /,  +  1  safely.  On 
each  attempt,  the  packet  m,  may  congest  the  path  po-  The  packet  always  has  at 
least  a  Cp  chance  that  it  will  make  it  to  level  +  1  on  any  given  attempt.  Thus  the 
number  of  trials  it  requires  to  succeed  will  be  distributed  somewhat  like  the  geometric 
distribution  with  parameter  Cp.  For  ease  of  notation,  set  =  a,,,  from  the  previous 
lemma.  In  each  attempt  by  the  packet  m„  there  are  choices  of  offset  which  will 
produce  congestion.  The  following  technical  lemma  will  help  bound  these  multiple 
contributions. 

Lemma  2.8.  Consider  a  family  of  nonnegative  integers  <  r  <  a}  where 
=  O(n^)  and  =  0(n)  for  all  r.  Let  be  a  set  of  random  variables 
with  geometric  distributions  g,  ~  G(ep)  (i.e.  y,  =  a  with  probability  «p(l  -  Cp)®"*) 
Then  with  high  probability,  ss  O(n’). 

Proof.  Order  the  integers  by  increasing  size  0i  <  07  <...<  Then  since 

0kn^lj  •  •  • »  0{k^l)n—l 

are  all  at  least  as  large  as  0kn,  we  know  that  0kn  =  0(n).  We  assume  that 

0,  =  0(n),  so  the  sum  0,  +  EjJi  *  0(n). 

Now  with  high  probability,  all  sums  ^">1 9kn-kr  ve  0(n).  We  know  that 

^^9r0T  ^  ^^{^l9kn+r)0{h'^l)n 

r  k  rml 

Thus,  with  high  probability,  J2r  9r0r  —  0(n*).  ■ 

Theorem  2.9.  If  we  route  using  offset  routing  and  the  hypercube  is  locally  routable, 
then  with  high  probability,  all  packets  are  delivered  in  0(log  N)  steps  and  all  nodes 
have  total  queue  size  0(log  N). 
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Proof.  Focus  on  the  path  po  of  ^  particular  message  mo-  We  will  show  that  the 
congestion  along  po  from  various  sources  is  0(n)  with  high  probability. 

Lemmas  2.5  and  2.6  bound  the  number  of  messages  which  have  the  potential  to 
congest  an  edge  of  mo’s  path  while  passing  between  levels  on  their  own  paths.  Enu¬ 
merate  the  packets  mi,m2,...,m,  which  have  a  non-zero  probability  of  congesting 
Po  while  traversing  an  edge  from  an  even  level  to  an  odd  level  in  their  virtual  paths. 
A  particular  packet  may  appear  several  times  in  the  enumeration — once  for  each  even 
level  node  along  its  virtual  path  from  which  it  might  congest  an  edge  of  po. 

The  packet  m,  has  at  least  CpU  paths  which  would  successfully  route  it  to  the  next 
level.  Arbitrarily  designate  exactly  Cpn  of  these  paths  as  special.  For  the  purposes 
of  our  analysis,  we  require  m,  to  choose  a  special  path  before  we  allow  it  to  route  to 
the  next  level.  This  can  only  increase  the  amount  of  congestion  placed  on  any  edge, 
since  it  increases  the  number  of  attempts  made  by  each  packet.  However,  once  mr 
does  choose  a  special  path,  we  always  place  it  in  the  last  node  of  the  first  fault-free 
path  it  found.  Thus  m,  winds  up  in  the  same  place  on  the  next  level  as  if  no  special 
requirements  had  been  made. 

Consider  the  choice  of  offsets  made  by  the  message  mr  at  even  level  /,.  Let  qr  he 
the  number  of  choices  of  pairs  of  offset  dimensions  (i,j)  for  the  message  mr  which 
would  congest  an  edge  in  mo’s  path.  Then  Eflr  =  0(n^)  by  lemmas  2.5  and  2.6.  (as 
described  in  the  discussion  inunediately  preceding  lemma  2.7,  ^qr  is  a.  second  way 
to  count  the  number  of  edges  in  5  and  T  according  to  multiplicity.) 

The  choice  of  the  dimension  t  was  actually  made  for  mr  at  level  /r  —  1.  The  choice 
was  made  randomly  and  uniformly  from  the  set  of  offsets  which  led  to  a  fault-free 
path  to  level  If.  The  exact  selection  of  offsets  t  are  dependent  from  packet  to  packet 
and,  for  a  particular  packet,  from  one  level  to  the  next.  However,  no  matter  how 
we  condition  on  previous  events,  there  are  always  enough  offsets  to  choose  from  at 
^my  given  moment.  Also,  the  bounds  on  the  probabilities  of  congesting  po  will  hold 
regardless  of  previous  events.  Let  I'l  <  ij  <  ...  <  i,r(r),  <r(r)  >  e^n,  be  the  choices 
of  offsets  at  level  /r  —  1  which  lead  to  a  fault-free  path  to  level  /r.  Let  Or,  equal  the 
number  of  offsets  j  such  that  if  mr  is  routed  from  level  L  —  1  to  level  L  using  offset 
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i,  and  next  to  level  +  1  using  offset  j,  then  congestion  results  in  mo’s  path.  Then 
since  E**^*-*  =  9r,  =  O(n^).  Since  the  total  number  of  offsets  j  is  n,  clearly 

Or,  <  n.  Let  t,,  be  the  offset  for  m,  actually  chosen  at  level  /,  —  1.  Lemma  2.7 
implies  that  with  high  probability  12, 0(r*r  =  For  convenience  of  notation,  set 

0T  —  ^rsr’ 

At  level  /„  whether  the  message  m,  chooses  a  path  from  the  set  of  e^n  special 
paths  or  the  set  of  (1  —  ep)n  nonspecial  paths,  it  has  at  most  0^  choices  which  congest 
mo’s  path.  Thus  whether  we  condition  whether  the  choice  was  special  or  nonspeciad, 
the  probability  that  message  m,  will  congest  mo’s  path  is  bounded  by 

Now  that  we  have  bounded  the  probability  that  the  packet  m,  will  congest  the 
path  po  during  one  of  its  attempts  to  route  to  level  /,+l,  we  can  bound  the  probability 
that  too  many  packets  actually  congest  po-  The  number  of  routing  attempts  made  by 
m,  is  gr  ~  G{€p).  On  each  attempt,  the  probability  that  m,  will  congest  mo’s  path 
is  at  most  Each  attempt  is  an  independent  trial  and  the  sum  of  the  probabilities 
of  congestion  in  the  trials  is  at  most  which  is  0(n)  by  lemma  2.8.  By  a 

moment  generating  function  argument  identical  to  that  in  lemma  2.1  and  2.7,  with 
high  probability  0(n)  attempts  actually  did  congest  mo’s  path.  Since  each  attempt 
involves  at  most  six  edges,  each  attempt  can  add  at  most  six  to  the  congestion  on 
mo’s  path.  Thus  with  high  probability,  the  total  congestion  on  the  path  from  routing 
attempts  at  even  levels  is  0(n). 

Next  examine  the  congestion  on  po  from  other  packets  beginning  and  ending  their 
paths.  For  a  packet  to  congest  an  edge  as  the  first  jump  edge  of  its  path,  it  has 
to  be  generated  by  one  of  the  edge’s  endpoints.  Thus  there  are  at  most  0(n)  such 
packets.  Now  consider  those  packets  congesting  po  during  the  ending  of  their  p  ths. 
Each  of  the  three  jump  edges  used  to  finish  off  a  path  has  am  endpoint  which  is  at 
distance  one  from  the  virtual  destination.  Thus  at  most  0(n)  packets  exist  which 
have  the  potentiad  to  congest  any  given  edge  as  the  first,  second  or  third  of  these  jump 
edges.  Therefore  a  total  of  O(n^)  packets  have  a  non-zero  probability  of  congesting 
some  edge  of  po  ais  they  finish  their  routes.  An  argument  adong  the  lines  of  the  one 
bounding  congestion  at  even  levels  shows  that  congestion  from  these  sources  is  0(n) 
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as  well. 


The  same  argument  bounds  congestion  from  routing  attempts  at  odd  levels,  and 
also  bounds  congestion  on  edges  incident  to  any  fixed  node.  ■ 


2.3  Information  Dispersal  Routing 

The  offset  routing  algorithm  cannot  tolerate  faults  which  occur  during  a  particular 
routing  phase.  If  a  packet  resides  in  a  node  as  it  fails,  that  packet  is  irretrievably  lost. 
Rabin  ([Rab])  discovered  how  to  use  the  technique  of  information  dispersal  to  route 
even  in  the  presence  of  failing  nodes,  provided  each  fault  occurs  with  probability  no 
more  than  0(l/n^). 

In  this  section  we  will  present  a  simpler  variation  of  Rabin’s  algorithm.  We  also 
show  how  our  algorithm  handles  faults  occurring  with  probability  0(l/n).  First,  we 
will  briefly  sketch  the  main  ideas  of  the  original  routing  algorithm.  Each  packet  is 
dispersed  into  n  pieces  sent  along  node-disjoint  paths  to  different  locations  smd  then 
along  node-disjoint  paths  to  the  final  destination. 

Since  every  piece  needs  to  carry  f)(n)  bits  of  routing  information,  the  original 
packets  must  necessarily  be  large.  For  concreteness  we  assume  that  all  packets  contain 
L  =  n(n^)  bits.  Any  piece  created  will  contain  O^Lfri)  bits.  We  also  assume  that  all 
links  and  nodes  have  the  capacity  to  hold  a  constant  number  of  the  original  packets 
(and  therefore  0(n)  pieces). 

Rabin  proves  that  with  high  probability,  the  number  of  pieces  crossing  any  node 
or  link  never  exceeds  its  capacity.  No  piece’s  progress  is  ever  delayed  by  a  full  queue 
in  the  node  ahead.  This  guarantees  that  each  piece  can  move  during  every  step  and 
that  the  entire  routing  will  take  no  more  than  2(n  -b  1)  steps — n  +  1  steps  for  each 
piece  to  arrive  at  its  random  intermediate  location  and  another  n  -b  1  to  arrive  at  its 
final  destination. 

As  Rabin  points  out,  routing  with  dispersal  of  information  can  tolerate  faults  if 
the  dispersal  into  pieces  is  done  with  more  redundancy.  The  pieces  may  actually  be 
constructed  in  such  a  way  that  the  arrival  of  half  (or  some  other  constant  fraction) 
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of  them  is  enough  to  reconstruct  the  original  message.  Rabin  shows  how  to  do  this 
through  matrix  multiplication.  He  then  proves  that  if  each  link  has  probability  1  /n^  of 
failure,  then  with  probability  1— 2iV(4c/n)"/^  all  messages  will  be  safely  reconstructed 
at  their  destinations. 

2.3.1  Routing  Along  Parallel  Paths 

Our  improvement  of  Rabin's  results  stems  from  a  more  uniform  and  efficient  selection 
of  paths  for  the  routing  of  pieces.  The  n  pieces  are  first  sent  sent  to  the  neighbors 
of  the  node  which  generated  the  packet.  These  pieces  are  then  routed  along  parallel 
paths  to  the  neighbors  of  a  random  intermediate  node.  From  there  the  pieces  are 
routed  along  parallel  paths  to  the  neighbors  of  the  intended  destination,  and  from 
there  to  the  destination  itself.  Except  for  the  dispersal  of  the  pieces  to  the  neighbors 
of  the  source  and  the  recovery  of  the  pieces  from  the  neighbors  of  the  destination,  the 
algorithm  can  be  viewed  as  a  butterfly  algorithm.  We  will  use  the  butterfly  for  our 
analysis.  A  picture  of  a  set  of  parallel  paths  appears  in  figure  2-7. 

If  V  and  tv  are  two  hypercube  nodes,  let  Jri(v,u;)  be  the  path  from  v‘  to  w'  used  in 
one  phase  of  the  Valiant-Brebner  scheme.  Let  11(0,  tv)  s  {)r,(o,u;)|l  <  t  <  n}  be  the 
set  of  all  possible  such  paths.  We  will  first  show  that  if  each  node  v  chooses  a  node 
v'  uniformly  and  then  routes  a  different  piece  along  each  of  the  n  paths  in  n(v,  v') 
that  only  0{n)  pieces  reside  in  any  node’s  queue  at  any  time  step. 

Lemma  2.10.  Consider  the  collection  of  ell  paths  in  the  N  sets  n(v,v')  (varying 
over  V ),  where  each  hypercube  node  v  has  chosen  a  node  v'  randomly  and  uniformly. 
For  any  node  u  and  any  integer  0  <  j  <  n,  with  high  probability  u  is  the  node 
along  only  0(n)  paths  in  the  collection. 

Proof.  If  u  is  the  ji**  node  along  the  path  iri(o,u;)  then  u*  *  tvitvj . . .  lOjWj+t . . .  t>n* 
Separate  the  two  cases  in  which  either  *  <  j  or  »  >  j.  If  *  <  j,  then  it  must  be 
that  Oj+i . . .  =  u^+i ...  tin.  Precisely  2^  nodes  satisfy  this  condition  for  v.  If  one  of 

these  nodes  chooses  a  w  such  that  wx...  . . .  tVj  =  Uj . . .  Uj  for  some  i  <  j, 

then  u  will  be  the  node  along  exactly  one  path  Ti(v,tv).  Otherwise,  u  will  be  the 
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Figure  2*7:  A  path  (dashed  lines)  and  its  adjacent  set  of  parallel  paths  (shaded). 

node  along  none  of  the  paths  Xi{v,tv)y  i  <  j.  Thus  for  each  of  the  2^  nodes,  the 
probability  of  exactly  one  such  path  is  j/2^  and  the  probability  of  no  such  paths  is 

If  i  >  j,  then  Uj^,i . . . •  ••  for  some  »  >  j.  Precisely 

(n  —  j)2^  nodes  satisfy  this  condition.  All  reasoning  is  the  same  as  in  the  previous 
case,  except  now  w  must  be  chosen  so  that  »  Ui . . . Uj.  Thus  the  probability 

that  u  is  the  node  along  exactly  one  such  path  is  1/2'.  The  probability  that  no 
path  ]ri(v,  u;),  t  >  j  crosses  u  in  this  fashion  is  1  —  1/2'. 

We  now  need  only  consider  the  sum  of  2'  0-1  random  variables  each  with  proba¬ 
bility  ^  of  equalling  1  and  (n  ~  i)2'  0-1  random  variables  each  with  probability  ^ 
of  equalling  1.  Call  this  sum  X.  Then  the  moment  generating  function  M(A)  for  X 
satisfies  M{\) 

n  .  .  i\’Vi  .  . 


{n-j)V 


=  e"<**-*> 

Thus  Pr[X  >  frn]  <  =  (e**"*^"*)".  Setting  A  =  ln6,  this  implies 
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Pr[X  >  in]  <  an  inverse  polynomial  whose  exponent  can  be  made  as 

small  as  desired  by  increasing  the  constant  b.  ■ 

The  t**  piece  created  from  v’s  packet  is  sent  to  v\  along  the  path  iri(v,w)  to  w' 
and  then  to  w.  By  lemma  2.10,  at  no  time  do  more  than  bn  pieces  cross  a  given 
hypercube  node,  with  high  probability.  Since  the  packets  traversing  any  link  all  come 
from  one  of  the  link’s  endpoints,  no  more  than  2bn  pieces  cross  the  link  during  any 
step  of  the  routing.  If  all  links  and  nodes  have  the  capacity  to  hold  26  original  packets, 
then  with  high  probability  no  buffering  is  necessary  and  no  piece  waits  in  a  queue. 

This  analysis  assumes  that  each  node  routes  its  packet  to  a  random  destination. 
If  we  use  two  phases  as  in  the  Valiant-Brebner  scheme,  the  results  extend  to  arbitrary 
permutation  routings: 

Theorem  2.11.  //ai/  packets  are  divided  into  n  pieces  which  are  routed  along  parallel 
paths  in  both  phases  of  the  routing  algorithm,  then  for  an  arbitrary  permutation,  with 
high  probability  the  two-phase  routing  takes  2(n  +  1)  steps.  No  piece  waits  at  any 
time.  ■ 


2.3.2  Fault-Tolerant  Encoding  of  Pieces 

By  giving  the  pieces  noore  structure,  we  can  make  the  information  dispersal  fault- 
tolerant.  We  partition  each  packet  F  into  n  0(L/n)-bit  pieces,  but  in  such  a  way 
that  if  any  m  s  n/2  of  the  pieces  arrive  at  the  destination,  the  original  packet  may 
be  reconstructed. 

Matrix  multiplication  is  used  to  encode  and  decode  the  pieces.  We  need  an  n  x  m 
matrix  A  every  m  rows  of  which  are  linearly  independent.  We  use  the  Hilbert  matrix 
Aij  =  l/(zi  +  y,),  where  *,•  ^  Xi>  Vi  ^  y,  y,/  V;  ^  >'  and  z,  +  y>  ^  0  Vi,  j.  For 
all  these  distinctness  conditions  to  hold  we  need  a  large  field.  We  will  use  the  field 
GF(2*),  with  s  «  log  log  iV  =  logn. 

Let  A*  be  the  matrix  formed  by  rows  t'l,  t'a, . . . ,  t’m  of  A.  Then 

|j/|  nfc<i(»u  -  gi.)(y»  -  yt) 

'  ' "  +  y«) 
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Using  this  identity  and  Cramer’s  rule,  we  can  invert  any  m  rows  of  A  in  0(m*s*) 
steps. 

To  take  advantage  of  the  Hilbert  matrix  A  we  block  the  bits  of  the  message  F 
into  a  matrix  over  the  field  GF(2*).  Write  F  =  bi...hi,  where  /  =  L/s  and  each  6,  is 
an  s-bit  byte  interpreted  as  an  element  of  GF{2*).  Group  the  bi's  into  //m  columns 
of  m  bytes  each,  and  call  this  matrix  B.  Each  source  u  computes  Fi,  . . . ,  F,  as  the 
rows  of  the  matrix  product  AB: 


<*11  •  '  •  <*lfn 

^  f>m+l  ‘  *  •  ^J-m+1 

'  F  ' 

<*21  •  •  •  <*2m 

• 

^  6m+2  •  •  • 

= 

F2 

_  <*fil  •  •  •  <*nm 

_  hm  b2m  •  “  hi 

.  . 

Given  m  pieces  (rows  F),  the  destination  w  can  reconstruct  B  (i.e.  the  packet  F) 
since  the  corresponding  m  rows  of  A  are  linearly  independent.  The  destination  just 
inverts  the  matrix  containing  those  m  rows. 

Note  that  it  takes  0(m^//m)  =  0{nl)  word  operations  to  multiply  these  matrices, 
or  0(nls  log  s)  bit  operations.  The  routing  itself  will  take  only  0(nls)  bit  operations, 
so  there  exists  a  log  s  s  log  log  log  N  gap  between  the  complexity  of  the  encoding  and 
the  routing  stages  of  the  protocol. 

2.3.3  Fault-Tolerance  via  Parallel  Paths 

If  we  encode  the  original  packet  in  the  pieces  via  Rabin’s  matrix  multiplication,  then 
we  can  bound  the  probability  that  v’s  packet  is  lost  by  the  probability  that  some  n/2 
of  its  pieces  run  into  faulty  components.  But  if  that  many  pieces  are  lost,  then  at 
least  n/4  are  lost  during  one  of  the  two  phases  of  the  routing  algorithm.  Assume  they 
are  lost  in  the  first  phase;  the  reasoning  for  phase  2  is  identical.  There  are  at  most 
(2n  +  3)n  different  components  (nodes  or  links)  encountered  by  pieces  from  v  during 
the  first  phase.  We  need  the  following  bound  on  the  number  of  intersections  between 
the  routes  of  different  pieces. 

Lemma  2.12.  For  any  hypercube  node  no  more  than  two  paths  in  n(v,  w) 
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cross  u. 


Proof.  Count  the  nodes  along  the  path  tv)  starting  with  v*  as  the  0*'^  node. 
Say  that  the  node  along  iri{v^w)  is  the  same  as  the  node  along  for 

t  <  j.  Then  w{w2 . . .  where  =  v,  \f[  q  ^  ^ 

and  similarly  for  w^'. 

There  are  four  cases.  If  ib,  /  <j  then  vj  =  Vj,  a  contradiction.  Similarly, 
then  =  w\,  a.  contradiction.  U  k  <  ij  >  j  or  ii  I  <  i,k  >  j  then  it  must  be  true 
that  Wi  =  Vi,  Wj  =  Vj  and  Wi,  =  vi,  (or  i  <  h  <  j.  Thus  all  ir*(t;,u;)  with  i  <  h  <  j 
are  precluded  from  crossing  u  (otherwise  tvi,  —  vi,,  a  contradiction).  Therefore  three 
paths  cannot  all  cross  u.  I 

Since  no  component’s  failure  will  affect  more  than  two  pieces,  it  must  be  true  that 
at  least  n/8  of  the  (2n  +3)n  components  have  failed.  Only  in  a  small  fraction  of  fault 
patterns  will  so  many  failures  occur  in  such  a  small  set  of  components. 

Theorem  2.13.  For  any  constant  ib  >  0  there  is  a  suiBciently  large  constant  b>  0 
such  that  if  each  component  of  the  bypercube  fails  independently  with  probability 
1/bn  before  or  during  some  permutation  routing,  then  with  probability  1  -  iV“*  the 
routing  will  be  successfully  completed.  That  is,  a  given  packet  will  arrive  at  its 
destination  iff  both  its  origin  and  destination  do  not  fail. 

Proof.  Whether  or  not  the  i*^  component  fails  gives  rise  to  a  0*1  random  vari¬ 
able  whose  moment  generating  function  is  M«(A)  =s  +  (1  ~  c))*  moment 
generating  function  for  the  sum  of  these  random  variables  is  M(A) 


Thus  we  can  bound  the  probability  that  more  than  n/8  of  the  components  fail  by 
exp(i2i=lH22±21  _  M).  Setting  A  =  In  we  see  that  the  probability  of  so  many 
failures  is  no  more  than  (ei(16/6)i)".  The  exponent  of  this  inverse  polynomial  in  N 
can  be  made  as  low  as  desired  by  increasing  the  constant  b.  I 
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2.4  Remarks 


Offset  routing  and  information  dispersal  are  complementary  techniques.  By  com¬ 
bining  this  simplified  variant  of  information  dispersal  with  offset  routing,  still  better 
results  are  possible.  The  combined  routing  algorithm  tolerates  the  failure  of  a  con¬ 
stant  fraction  of  the  hypercube's  components  during  the  course  of  the  routing  of  a 
single  permutation.  To  send  a  packet,  the  node  first  disperses  pieces  to  a  well  defined 
set  of  n  nodes  at  distance  three  (instead  of  neighbors).  The  packets  are  then  routed 
along  parallel  offset  paths  to  the  symmetric  set  of  n  nodes  close  to  the  destination. 
Finally,  the  pieces  are  combined  at  the  destination.  If  each  node  or  link  fails  inde¬ 
pendently  of  other  components  and  if  in  the  case  it  fails  it  does  so  at  a  random  time 
during  the  routing  then  this  combined  algorithm  tolerates  the  failure  of  a  constant 
fraction  of  the  hypercube’s  components. 
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Chapter  3 


Reconfiguration  in  the 
Presence  of  Faults 


3.1  Introduction 

In  this  chapter,  we  continue  our  investigation  of  the  tolerance  of  the  hypercube  to 
randomly  distributed  faults.  The  techniques  we  develop  assume  a  long-run  view. 
Given  that  faults  have  accumulated  in  a  hypercube  over  time,  each  component  inde¬ 
pendently  faulty  with  probability  p,  we  would  like  to  be  able  to  program  the  machine 
while  ignoring  whatever  faults  exist.  We  show  how  to  use  the  functioning  parts  of  a 
hypercube  with  faults  to  simulate  a  hypercube  without  faults  at  a  surprisingly  low 
cost.  More  precisely,  we  show  how  to  embed  a  hypercube  in  the  functioning  part  of 
a  hypercube  with  faults  so  that  features  such  as  locality  are  preserved. 

Before  we  can  state  our  results  formally  and  assess  their  value,  we  first  must  de¬ 
scribe  the  constraints,  assumptions  and  objectives  of  network  reconfiguration  and/or 
simulation  in  the  presence  of  faults.  We  divide  the  discussion  into  six  general  topic 
areas:  preservation  of  locality,  load  balancing,  message  trafiEic,  simulation  overhead, 
algorithms  for  implementation,  and  modelling  of  faults. 

An  embedding  of  a  network  Gi  into  another  network  Ga  is  a  map  ^  :  Gi  Gj 
that  maps  each  node  of  Gi  to  a  node  of  Ga  and  each  edge  of  G\  to  a  path  in  Ga 
between  the  images  of  its  endpoints. 
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We  call  the  pattern  of  faults  F.  That  is,  we  include  each  component  of  Hn  in  F 
independently  and  with  probability  p.  Therefore  the  functional  part  of  the  hyi>ercube 
is  Hn-  F.  An  embedding  of  Hn  into  Hn-  F  is  ^  map  <f> :  Hn  Hn  -  F  that  maps 
nodes  of  Hn  to  functioning  nodes  of  Hn  and  edges  of  Hn  to  functioning  paths  of  Hn- 
The  precise  definition  of  functioning  nodes  and  paths  will  vary,  although  the  general 
interpretation  is  straightforward. 

Preservation  of  locality.  Because  communication  in  hypercube-based  machines 
is  mostly  local,  and  because  conununication  is  a  dominant  factor  in  measuring  the 
performance  of  a  parallel  machine,  it  is  crucial  that  a  good  embedding  of  Hn  in 
Hn  —  F  preserve  locality.  In  other  words,  neighboring  processors  in  Hn  should  be 
mapped  to  nearby  processors  in  Hn  —  F.  In  order  to  quantify  this  notion,  we  say 
that  an  embedding  has  dilation  D  if  for  each  edge  e  in  Hn^  the  path  ^(e)  has  length 
at  most  D  in  Hn  —  F.  Of  course,  it  is  most  desirable  to  find  embeddings  with  small 
dilation.  At  the  very  least,  the  dilation  of  an  embedding  ^  is  a  lower  boimd  on  the 
time  required  for  Hn  —  F  to  simulate  a  single  step  of  Hn  if  the  computation  of  each 
node  V  €  Hn  is  performed  by  <f>{v)  in  Hn  —  F. 

The  notion  of  dilation  can  also  be  extended  to  paths.  We  wUl  describe  natural 
embeddings  of  Hn-i  in  Hn-  F  for  which  nodes  separated  by  distance  d  in  Hn.i  are 
mapped  to  nodes  separated  by  distance  d  +  2  in  Hn  —  F.  These  embeddings  have 
dilation  3. 

Balancing  the  load.  We  will  consider  embeddings  which  allow  several  nodes  of 
Hn  to  be  mapped  to  a  single  node  of  Hn  —  F.  Mappings  that  are  one-to-one  are  the 
most  desirable  since  then  each  processor  of  Hn  —  F  only  has  to  simtilate  the  action 
of  a  single  processor  of  Hn-  Iq  general,  we  define  the  max  load  of  an  embedding  to 
be  the  fnAvimum  number  of  processors  of  Hn  mapped  to  any  single  node  of  Hn—F. 
One  algorithm  we  describe  discovers  embeddings  with  max  load  1  (i.e.  one-to-one 
mappings)  while  the  other  finds  embeddings  with  constant  max  load. 

In  addition  to  having  small  max  load,  it  is  desirable  to  use  as  many  of  the  func¬ 
tioning  cells  of  Hn  —  F  as  possible.  The  use  of  live  cells  is  partly  described  by  the 
max  load.  To  further  characterize  this  quantity,  we  define  the  expansion  of  an  em- 
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bedding  to  be  the  ratio  of  the  size  of  the  largest  one-to-one  hypercube  we  could  hope 
to  embed  in  Hn  —  F  to  the  size  of  the  hypercube  that  we  do  embed.  Since  the  size  of 
a  hypercube  is  always  a  power  of  two,  the  expansion  is 


We  focus  on  embeddings  of  in  ^  for  P  <  ^.  Such  embeddings  have 
expansion  one,  which  is  the  best  possible. 

Message  traffic.  In  addition  to  balancing  the  processing  load  among  the  func¬ 
tioning  processors,  it  is  desirable  to  balance  the  message  routing  load  among  the  wires. 
In  particular,  it  would  not  be  good  if  many  paths  {^(e)|e  €  H^}  traversed  a  single 
wire  of  Hn  —  F  since  local  communication  along  these  paths  would  require  the  use 
of  the  same  wire.  To  formalize  this  notion  we  say  that  an  embedding  has  congestion 
C  if  every  edge  of  Hn  —  F  is  contained  in  at  most  C  paths  of  {^(e)|e  €  Hn^}.  We 
consider  embeddings  with  congestion  as  much  as  6(log  N)  and  as  little  as  0(1). 

Congestion  is  a  lower  bound  on  the  time  required  for  the  functioning  part  of  Hn 
to  simulate  H„^  if  messages  traversing  e  in  Hm  routed  along  ^(e)  in  Hn  —  F.  For 
some  specific  applications,  however,  we  can  do  better.  For  example,  hypercubes  are 
often  used  to  simulate  bounded-degree  networks  such  as  arrays  and  trees.  In  such 
applications,  only  a  constant  number  of  wires  incident  to  any  node  are  used  in  any 
parallel  step  of  Hm-  Hence,  the  effective  congestion  in  the  corresponding  embedding 
may  be  much  less  than  it  seems  at  first.  To  capture  this  notion,  we  say  that  an 
embedding  has  induced  congestion  I  if  every  edge  of  Hn  —  F  is  contained  in  at  most  I 
paths  of  {^(e)(e  €  Hn»}  for  which  the  edges  e  €  Hm  ve  node-disjoint.  The  two  main 
algorithms  in  this  chapter  find  embeddings  with  constant  induced  congestion.  Such 
embeddinp  are  particularly  useful  for  simulating  trees,  arrays,  normal  hypercube 
algorithms  and  other  structures  with  bounded  processor  degrees. 

Simulation  overhead.  One  obvious  use  of  a  hypercube  with  faults  is  to  simulate 
a  hypercube  without  faults.  This  can  always  be  done  given  enough  slowdown  and 
duplication  of  resources,  but  the  goal  is  to  make  the  simulation  as  effident  as  possible. 
The  key  factors  influencing  the  efllciency  of  the  simulation  are  dilation,  max  load  and 
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congestion.  By  achieving  good  bounds  on  these  vaiues,  we  show  that  any  step  of 
a  hypercube  Hn  can  be  simulated  by  0(1)  steps  of  Hn  -  F.  In  addition,  we  use 
the  notion  of  induced  congestion  to  show  that  a  hypercube  with  faults  can  simulate 
trees,  arrays  and  other  bounded-degree  networks  of  the  same  size  with  only  constant 
slowdown. 

Algorithms  for  implementation.  In  addition  to  proving  that  there  is  an  ef¬ 
ficient  embedding  of  Hm  in  —  F,  it  is  desirable  to  develop  an  efficient  algorithm 
for  finding  the  embedding.  Ideally,  the  algorithm  would  be  deterministic,  fast,  easy 
to  implement,  and  decentralized  (i.e.,  using  only  local  control).  In  fact,  we  describe 
such  algorithms  in  sections  3.2  and  3.3  We  also  describe  a  fast,  local  probabilistic 
algorithm  in  section  3.5. 

Modelling  of  faults.  In  general,  we  might  consider  three  types  of  faults  in 
Hn-  The  most  serious  fault  would  be  one  that  completely  destroys  a  node  and  all 
wires  incident  to  it.  We  call  such  faults  total.  A  less  serious  fault  would  be  one  that 
destroys  just  the  computational  portion  of  a  node,  and  leaves  the  communication  (i.e. 
switching  or  routing)  portion  of  the  node  intact  as  well  as  the  incident  wires.  We  call 
such  faults  partial  (Note  that  it  does  not  make  sense  to  consider  a  fault  that  destroys 
just  the  communication  portion  of  the  node.  The  computation  portion  would  then 
also  be  useless  since  it  would  be  disconnected  from  the  rest  of  the  network.)  Last, 
faults  could  occur  in  individual  wires. 

In  our  model,  no  malicious  faults  occur.  Any  node  can  determine  if  a  neighboring 
node  or  link  has  failed  by  probing  the  link  in  0(1)  time. 

Along  with  the  type  of  fault,  the  distribution  of  faults  must  also  be  specified.  As 
with  routing  in  chapter  two,  we  consider  a  model  in  which  faults  occur  independently 
among  components  with  probability  p.  We  restrict  our  attention  to  the  situation 
where  p  <  ^,  although  the  methods  can  be  extended  for  larger  p.  In  addition,  we 
consider  the  case  in  which  the  number  of  faults  is  smaller  than  a  constant  fraction 
of  the  total  number  of  nodes.  The  assumption  concerning  independence  of  faults  is 
crucial  to  our  analysis,  but  the  methods  can  also  be  applied  in  a  hierarchical  setting 
where  entire  subcubes  of  nodes  may  fail  at  once.  Such  extensions  might  be  useful  in 
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a  practical  setting  where  the  actual  machine  may  consist  of  a  collection  of  boards, 
each  of  which  consists  of  a  collection  of  chips,  and  so  on. 

The  material  we  present  is  philosophically  related  to  previous  work  in  fault  tol¬ 
erance  of  arrays  in  the  context  of  wafer-scale  integration  ([Gr],[GG],[LLl],[LL2]),  al¬ 
though  the  techniques  and  results  are  quite  different.  For  example,  constant  dilation 
reconfiguration  is  not  possible  for  arrays  and  trees.  There  has  been  relatively  little 
previous  work  on  the  fault-tolerant  reconfiguration  of  hypercubes  (to  our  knowledge). 
An  exception  is  the  work  of  Becker  and  Simon  ([BS]),  who  consider  fault-free  sub¬ 
cubes  of  a  hypercube  containing  worst  case  faults.  The  constraint  that  the  embedded 
cube  be  a  subcube  (i.e.,  dilation  1)  is  very  restrictive,  as  is  the  assumption  that  faults 
are  located  in  a  worst  case  fashion.  Hence,  the  techniques  and  results  of  [BS]  are  quite 
different  from  those  presented  here.  Another  exception  is  the  work  of  Dolev,  Halpem, 
Simons  and  Strong  ([DHSS])  who  also  study  worst  case  bounds.  Their  model  of  com¬ 
munication  also  differs  from  ours  in  that  they  assiune  that  after  the  faults  occur,  the 
new  connections  must  be  chosen  from  a  predetermined  set  of  routings. 

3.1.1  Summary  of  Results 

At  first,  we  consider  an  iV-node  hypercube  containing  random  partial  processor  faults. 
We  describe  algorithm  3.1,  an  algorithm  for  embedding  an  iV/2-node  hypercube  in 
the  functioning  processors. 

Theorem  3.3.  Algorithm  3.1  is  a  iocai,  deterministic  O(log  N)  step  algorithm.  If 
the  nodes  of  Hn  fail  independently  and  partially  with  probability  p  <  ^  then  with 
probability  at  least  15/16  algorithm  3.1  constructs  a  one-to-one  embedding  of  //n-i 
into  Hn  —  F  with  dilation  3,  congestion  21ogiV,  and  induced  congestion  2. 

Next  we  improve  this  algorithm  so  that  it  embeds  an  N/2-node  hypercube  in 
the  functioning  processors  with  the  same  performance  with  probability  1  - 
provided  that  processors  are  faulty  with  probability  p  <  1/2,  for  sufficiently  large  N . 
The  algorithm  for  finding  the  embedding  is  deterministic,  easy  to  implement,  runs  in 
0(log  N)  parallel  steps,  and  uses  only  local  control.  As  a  result,  we  extend  the  results 
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of  Bhatt,  Chung,  Leighton  and  Rosenberg  [BCLR]  and  others  to  be  fault-tolerant. 
In  particular,  we  show  that  a  hypercube  with  partial  processor  faults  can  simulate 
any  binary  tree  or  mesh  of  the  same  size  with  only  constant  factor  slowdown.  The 
most  surprising  (and  potentially  most  useful)  feature  of  the  embedding  is  the  degree 
to  which  it  preserves  locality. 

Next,  we  extend  the  results  to  handle  total  faults.  We  describe  an  embedding  for 
which  the  dilation  is  7,  the  max  load  is  1,  the  congestion  is  O(log  N),  and  the  induced 
congestion  is  0(log  N/  log  log  N)  with  high  probability  (for  sufficiently  large  N).  The 
zdgorithm  for  achieving  these  results  is  probabilistic,  runs  in  O(log  N)  steps,  and  uses 
only  local  control. 

Finally,  we  address  the  issue  of  congestion.  We  demonstrate  a  probabilistic  al¬ 
gorithm  which  with  high  probability  finds  an  embedding  in  a  hypercube  containing 
totally  faulty  processors  for  which  the  dilation  is  5,  the  max  load  is  0(1),  and  the 
congestion  is  0(1). 

Theorem  3.22.  For  e^cb  p  <  I  —  (about  .16^  there  is  an  0(log  N)  step  algorithm 
such  that  if  each  of  the  nodes  of  an  N-node  hypercube  fails  with  probability  p  then 
with  probability  1  —  the  algorithm  finds  an  embedded  fully  functioning  N-node 
cube  with  constant  load,  dilation  and  congestion.  The  paths  which  simulate  the  edges 
of  the  cube  only  use  live  nodes. 

As  a  consequence,  a  faulty  hypercube  can  simulate  a  functioning  hypercube  of  the 
same  size  with  constant  delay. 

These  last  two  algorithms  actually  work  in  a  semi-worst  case  setting.  As  long  as 
a  constant  fraction  of  each  node’s  neighbors  remain  alive  and  a  constant  fraction  of  a 
specified  set  of  paths  for  each  node  have  no  faults  along  them,  the  good  embeddinfp 
exist. 

Chapter  three  is  the  result  of  joint  work  with  Johan  Hastad  and  Tom  Leighton. 
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3.1.2  Overview 


In  section  3.2,  we  consider  only  partial  faults  which  occur  with  probability  p  <  1/4. 
We  extend  the  algorithm  to  handle  failure  probabilities  up  to  1/2  in  section  3.3.  A 
probabilistic  algorithm  for  reconfiguring  with  total  faults  appears  in  section  3.4  and 
section  3.5  contains  a  probabilistic  algorithm  achieving  constant  delay  reconfiguration 
with  total  faults.  In  section  3.6  we  describe  a  way  to  implement  the  algorithms  of 
section  3.5  so  that  they  run  in  0(log*  N  log  log  N)  time.  In  section  3.7  we  extend  our 
results  to  the  cases  where  the  probability  of  failure  is  very  low  and  also  to  the  case 
where  edge  faults  occur. 


3.2  Embeddings  for  Small  p  with  Dilation  3 

In  this  section  we  consider  the  less  severe  model  of  partial  faults  where  it  is  possible  to 
use  the  faulty  processors  as  switches  and  to  route  through  them.  We  assume  that  the 
probability  that  any  given  processor  fails  is  less  than  or  equal  to  1/4  and  we  present  an 
algorithm  which  with  probability  15/16  constructs  a  one-to-one  embedding  of  Hn-\ 
in  Hn  with  dilation  3,  congestion  2n  (=  21ogAf)  and  induced  congestion  2. 

3.2.1  Mapping  Dead  Nodes  to  Live  Nodes 

Let  Hti-i  he  the  subhypercube  on  N/2  —  2"“*  nodes  induced  by  the  nodes  with  first 
coordinate  zero.  For  each  node  v  in  let  v'  be  the  node  with  first  coordinate 
different  from  v’s  whose  coordinates  otherwise  agree  with  those  of  v  (i.e.  v’s  neighbor 
across  the  first  dimension).  Also,  for  a  node  y  =  (yi, y3<  •  •  •  > yn-i)  in  Hn~u  iet  y  be 
the  node  in  Hn  with  coordinates  (0,yi,y3>-'-.yn-i)- 

Given  some  pattern  of  failure  for  the  nodes  in  say  a  node  v  €  Hn-\  is 
if  both  V  and  v'  are  live  and  poor  if  both  v  and  t/  are  dead.  If  every  node  in  Hn-i 
were  not  poor,  we  could  easily  embed  Hn^i  in  Hn  -  F  by  mapping  (yi, . . . , yn-i)  to 
whichever  of  {(0,yi,...,y„_i),(l,yi,...,y«-i)}  were  alive.  Unfortunately,  there  wiU 
be  a  constant  fraction  of  poor  nodes  in  Hn-i  with  very  high  probability  since  each 
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Figure  3-1:  Borrowing  &om  a  neighbor.  Dead  nodes  are  shown  as  white.  The  arrow 
points  from  the  simulated  node  to  the  simulating  node. 

node  in  Hn-\  is  poor  with  probability 

We  handle  the  existence  of  poor  nodes  by  mapping  each  poor  node  v  to  a  ndgh- 
boring  rich  node  w.  At  most  one  v  is  mapped  to  a  given  node  w.  Hence  v  can  borrow 
w\  In  this  fashion  we  will  be  able  to  embed  ffn-i  m  Hn~-  F  much  as  if  there  were 
no  poor  nodes  at  all.  At  step  k  algorithm  3.1,  shown  in  figure  3-2,  will  attempt  to 
assign  v  to  if  v  and  v*'*’'  are  as  yet  unassigned. 

for  4—  2  to  n  for  all  nodes  v 

if  V  is  poor  and  unassigned  and  v*  is  rich  and  unassigned  assign  v  to  v* 


Figure  3-2:  Algorithm  3.1. 


3.2.2  Analysis  of  the  Borrowing 

If  processors  fail  with  probability  less  than  or  equal  to  1/4,  algorithm  3.1  will  construct 
an  embedding  with  probability  at  least  15/16.  We  show  this  by  proving  two  simple 

*  We  use  the  phrase  Q  is  more  than  (i(g)  with  very  high  probability  to  mean  “There  exist  constants 
k  and  d  independent  of  W  such  that  the  probability  that  Q  does  not  exceed  dg  is  leas  than 


lemmas.  First  we  will  prove  a  lemma  which  will  be  of  crucial  importance  to  the  later 
analysis. 

Lemma  3.1.  At  step  k  —  1,  what  has  happened  to  v  is  independent  of  what  has 
happened  to  any  node  which  differs  from  v  in  at  least  one  of  the  last  n—k  coordinates. 

Proof.  At  step  t,  nodes  that  affect  each  other  have  all  coordinates  identical  except 
the  ith  coordinate.  Thus  at  step  Jfc  —  1 ,  if  we  divide  the  nodes  into  groups  each  having 
identical  last  n  —  k  coordinates,  all  previous  communication  has  taken  place  within 
the  groups.  Thus  nodes  in  different  groups  cannot  affect  each  other.  I 

Lemma  3.2.  The  probability  that  a  given  node  v  is  poor  and  unassigned  after  the 
i**  step  is  at  most  (2p)*p*. 

Proof.  For  each  node  v  let  p,  =  Pr[v  is  poor  and  unassigned  after  step  *]  and  ?,•  = 
Pr[t;  is  rich  and  unassigned  after  step  *].  Then  Po  =  P*  9o  =  (1  “*  ?)*•  ^  node  v 
will  be  poor  and  unassigned  after  step  » -f  1  if  and  only  if  it  was  poor  and  unassigned 
after  step  i  and  the  node  it  requested  in  step  i  +  1  was  not  rich  and  unassigned.  A 
similar  statement  holds  for  whether  a  node  is  rich  and  unassigned  after  step  i  +  1. 
Thus,  since  these  probabilities  are  independent, 

qi+i  =  9<(i  -  Pi) 
pi+i  =  Pi(i  -  qi) 

Subtracting  the  two  equations  yields  —  pi+.i  =  qt  ~  Pi  ^  which  is  natural  since  the 
surplus  of  rich  nodes  over  poor  nodes  is  constant.  Thus  the  difference  is  9,  —  p,-  — 
9o  -  Po  =  1  -  2p,  or  9<  =  1  -  2p  +  p,.  Therefore  p,+i  =  pi(2p  -  Pi)  <  (2p)p,  and  so 
Pi  <  (2p)’Pb  «  (2p)y .  ■ 

The  probability  that  an  individual  node  is  poor  and  unassigned  at  the  end  of  the 
algorithm  is  less  than  (l/2)"-Hl/16)  =  l/8iV.  Thus  the  probabiUty  that  some  node 
is  poor  and  unassigned  is  no  more  than  (iV/2)(l/8JV)  :=  1/16. 
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Figure  3-3:  Mapping  an  edge  to  a  path.  The  heavy  edges  form  the  path  chosen  to 
simulate  the  edge  between  the  two  simulated  nodes  at  bottom. 

3.2.3  Embedding  Edges 

If  the  algorithm  successfully  assigns  each  poor  node  to  a  rich  node  call  the  assignment 
Ip.  Embed  Hn-i  in  Hn  with  the  embedding  ^  which  maps  nodes  in  to  nodes  in 
Hnhy 

Iy  if  y£Hn-F 
ivY  if  ivY  €  Hn-  FhMty  ^  F 
rp{yY  otherwise 

and  maps  edges  in  Hn-i  to  paths  in  Hn  by 

(^(»),  if  ^(y)  *  sr,^(z)  s  i 

^  :  (y,  Z)  •-»  <  (^(y),  ^(y)',  y,  ^(z))  if  ^(y)  ^  ^,^(z)  =  z 

'  (^(y)>  (yy»  («y.  if  ^(y)  9^  y,^(*)  7^  * 

Although  the  mapping  of  the  edges  looks  complicated,  every  edge  simply  maps  to  a 
shortest  path  between  the  corresponding  nodes.  Figure  3-3  depicts  an  instance  of  the 
third  possibility.  Since  in  all  cases  the  length  of  the  path  is  at  most  3  the  embedding 
has  dilation  3. 

To  check  the  congestion,  observe  that  a  given  edge  is  used  on  a  given  path  0((y,  z)) 
only  when  one  of  its  endpoints  is  ^(y),  y,  or  (z)'.  Checking  cases  shows  that  no  edge 
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could  lie  on  three  paths  corresponding  to  node-disjoint  edges.  Thus,  since  we  can 
partition  the  edges  of  the  hypercube  into  n  matchings,  the  congestion  can  be  no 
worse  than  2n.  By  this  argument  it  also  follows  that  the  induced  congestion  of  the 
embedding  is  at  most  2. 

Theorem  3.3.  Algorithm  3.1  is  a  local,  deterministic  0(log  N)  step  algorithm.  If 
the  nodes  of  Hn  fail  independently  and  partially  with  probability  p  <  1/4  then  with 
probability  at  least  15/16  algorithm  3.1  constructs  a  one-to-one  embedding  of  Hn-i 
into  Hn^  F  with  dilation  3,  congestion  21ogiV,  and  induced  congestion  2. 


3.3  Embeddings  with  Dilation  3  for  p  <  1/2 

In  this  section  we  extend  the  algorithm  of  section  2  so  that  it  can  handle  independent 
faults  with  probabilities  exceeding  1/4  but  less  than  1/2.  This  is  best  possible  in  the 
sense  that  if  p  >  1  /2,  then  more  than  half  of  the  nodes  will  fail  with  probability  at 
least  1/2.  In  that  case  it  would  be  impossible  to  achieve  a  one-to-one  embedding  of 
in  Ifn  -  F. 

Call  a  node  v  €  ^n-i  &  topnode  if  v  is  dead  but  v'  is  alive.  We  now  handle  the 
existence  of  poor  nodes  by  mapping  each  poor  node  v  to  a  neighboring  node  w  which 
is  either  rich  or  a  topnode.  If  u;  is  a  topnode,  we  make  sure  that  w  has  a  rich  neighbor 
u  so  that  w  can  borrow  u'.  We  call  this  process  pushing  a  topnode. 

Algorithm  3.2,  shown  in  figure  3-5,  will  carry  out  the  program  outlined  above  in 
4  stages.  The  only  additional  feature  is  that  poor  nodes  without  enough  topnode 
neighbors  will  be  treated  separately. 

Observe  that  conflicts  can  only  occur  during  stage  4,  and  can  be  easily  resolved 
by  having  the  node  with  lower  index  win. 

Lemma  3.4.  Assume  that  nodes  fail  independently  with  probability  p  <  1/2  and 
that  N  is  sufBciently  large.  Then  there  is  a  constant  cs  >  0  such  that  after  algorithm 
3.2  terminates,  with  probability  1  — (i)  every  poor  node  is  assigned  to  a  neighbor 
which  is  either  a  topnode  or  a  rich  node,  and  (ii)  every  topnode  which  has  been 
assigned  to  a  poor  node  is  pushed  to  a  rich  neighbor. 
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Figure  3-4:  Pushing  a  topnode.  Dead  nodes  are  shown  as  white.  The  arrows  point 
from  the  simulated  nodes  to  the  simulating  nodes. 

3.3.1  Analyzing  Stages  1  and  2 

Stages  1  and  2  comprise  a  “first  pass’*  to  assign  poor  nodes.  In  stage  1,  those  poor 
nodes  with  few  top  node  neighbors  are  given  first  crack  at  assignments,  since  these 
nodes  will  have  much  less  ability  to  push  neighboring  topnodes  later  in  stage  4. 
Stage  2  replicates  algorithm  3.1.  We  expect  that  the  vast  majority  of  nodes  will  find 
assignments  during  this  stage. 

Let  e  be  a  small  positive  constant  depending  on  p  and  let  d  =:  d(e,  p)  and  c  =  c(c,  p) 
be  suitable  positive  constants  depending  only  on  e  and  p.  Throughout  the  argument 
we  will  assume  that  N  is  sufficiently  large.  The  neighborhood  of  a  point  is  the  set  of 
points  at  distance  1  from  the  point,  and  a  sphere  denotes  a  sphere  in  the  Hamming 
metric. 

Lemma  S.5.  For  e  <  p(l  —  p)lA->/2  there  is  a  positive  integer  constant  d  s  d(c,p) 
such  tbet  for  sufSciently  large  N,  with  probability  1  —  1/N  no  sphere  of  radius  6 
contains  more  than  d  nodes  processed  in  stage  1. 

Proof.  Take  any  sphere  of  radius  6  and  any  d  points  in  this  sphere.  The  union 
of  their  neighborhoods  is  of  size  at  least  dn  —  d*  >  dn/2.  (N  —  2"  is  assumed  to 
be  sufficiently  large  and  any  two  neighborhoods  do  not  have  more  than  2  points  in 
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Stage  1: 

for  every  poor  node  v  which  has  fewer  than  cn  topnodes  as  neighbors 
across  dimensions  >  en  do 
for  A: 2  to  n 

if  V*  is  rich  and  unmarked,  mark  it  v 

Stage  2: 

for  k  *— 2  to  en  for  all  nodes  v 

if  V  is  poor  and  unassigned  and  is  rich  and  unassigned  assign  v  to 

Stage  3; 

for  every  node  v  which  was  processed  in  stage  1 
assign  v  to  the  node  which  was  marked  v 
for  every  node  w  assigned  to  a  marked  node  during  stage  2 
w  becomes  unassigned 

Stage  4: 

for  all  unassigned  poor  nodes  v  do 
for  fc  ^  en  +  1  to  n 

if  V*  is  an  unpushed  topnode  and  there  is  an  unassigned  rich  node 
for  some  j  >  e  log  N 
assign  v  to  u*  and  push  to  v*' 


Figure  3-5:  Algorithm  3.2. 


common).  By  assumption,  at  most  2den  nodes  in  this  neighbor  set  can  be  topnodes. 
Since  the  probability  that  an  individual  node  is  a  topnode  is  p(l  —  p)  the  probability 
of  having  exactly  t  topnodes  in  a  set  of  size  dn/2  is 


p,  -  -pmi  -iKi  -!>))♦- 


Observe  that  Pi+\fpi  >  y/2  for  t  <  p(l  —  p)dn/2‘\/2.  Using  this  and  the  fact  that  any 
Pi  is  less  than  1  we  get 
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The  probability  that  fewer  than  2den  nodes  in  a  set  are  topnodes  decreases  as  we 
make  the  set  larger  than  dn/2.  Thus  this  bound  holds  no  matter  what  the  actual  size 
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of  the  union  of  the  neighborhoods  may  be.  Since  there  are  at  most  N  possible  spheres 
and  at  most  (  j*)  ways  o.  choosing  d  points,  the  lemma  follows  for  large  enough  d.  ■ 

Next  we  show  that  stage  1  has  a  good  probability  of  success  for  the  nodes  to  which 
it  is  applied. 

Lemma  3.6.  The  probability  that  there  exists  a  node  which  has  fewer  than  2cn 
neighbors  across  dimensions  greater  than  cn  which  are  either  rich  or  topnodes  is 
bounded  by  iV~®^  for  sufScientiy  small  c  >  0  and  cj  >  0. 

Proof.  This  proof  uses  reasoning  similar  to  that  of  the  proof  of  lemma  2.3.  Choose 
c'  >  0  such  that  (1  —  c')  logp  =  —  1  —  for  Cg  >  0.  The  probability  of  having  fewer 
than  6  log  N  neighbors  across  dimensions  greater  than  ^  log  which  are  either  rich 
or  topnodes  is 

^(1  -  O  _  pj,yi-oioc/v-< 

We  can  compare  consecutive  terms  to  show  that  this  stun  is  bounded  by  a  constant 
times  the  last  term.  The  last  term  is 

=  exp3(^  log  e(  1  -  e')  log  iV  —  5  log  b  log  N 

+5  log  N  log  (1  -  p)  -  ^  log  iV  log  p  +  (1  -  e')  log  N  log  p) 

=  exp,(h(tf,p)logiV  +  (l 

For  6  =  0,  h(6,p)  =  0  and  the  above  expression  is  Since  h{6,p)  is 

continuous,  there  is  a  £  >  0  such  that  the  above  expression  vs  bounded  by 
with  cr  >  0.  Finally  set  e  =  min{e',£/2}  and  observe  that  decreasing  c'  to  c  can 
only  decrease  the  probability  of  having  at  most  6\ogN  topnode  neighbors  across 
dimensions  >  ef  log  N.  I 

By  lemma  3.6,  with  probability  1  —  N~^  any  node  processed  in  stage  1  must  have 
at  least  en  rich  neighbors.  The  only  way  it  could  fail  to  mark  one  of  these  is  if  they 
were  all  marked  by  other  nodes.  This  is  impossible  since  by  leiruna  3.5  there  is  only 
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a  constant  number  of  nodes  processed  during  stage  1  within  distance  6.  Thus  each 
node  participating  in  stage  1  successfully  marks  a  rich  node. 

Let  us  next  analyze  stage  2.  Note  that  stage  2  is  independent  of  stage  1.  By  lemma 
3.2,  after  stage  2  the  probability  that  an  individual  node  is  poor  and  unaissigned  is 
<  while  the  probability  that  it  is  rich  and  unassigned  is  >  (1  —  2p). 

Let  a  €  {0,1}*"  and  let  Ha  be  the  (1  —  e)n  dimensional  hypercube  which  has  the 
coordinate  a,,  t  =  1, . . . ,  cn.  Observe  that  by  lemma  3.1  the  status  of  nodes  in  an 
individual  hypercube  are  independent  during  stage  2. 

Lemma  3.7.  There  is  a  constant  d  =  d(p)  such  that  the  probability  that  there  is  a 
sphere  of  radius  4  in  any  Ha  which  contains  more  than  d  unassigned  poor  nodes  after 
stage  2  is  less  than  1/N. 

Proof.  There  are  N  ways  to  choose  a  sphere  over  all  Ha's  and  at  most  ways  of 
choosing  d  points  in  each  sphere.  The  probability  that  these  d  nodes  are  poor  and 
unassigned  is  and  the  lemma  follows  for  sufficiently  large  d.  ■ 


3.3.2  Analyzing  Stages  3  and  4 

Stages  3  and  4  are  responsible  for  assigning  those  nodes  which  remain  unassigned  after 
stage  2.  In  stage  3,  nodes  assigned  in  stage  1  negate  some  of  stage  2’s  assignments. 
Bumped  nodes  find  new  assignments  in  stage  4. 

Lemma  3.8.  With  high  probability,  after  stage  3  there  are  only  2d  unassigned  poor 
nodes  in  any  sphere  of  radius  4  in  any  Ha- 

Proof.  This  follows  from  lemmas  3.5  and  3.7.  The  only  additional  unassigned  poor 
nodes  come  from  the  nodes  whose  assignments  are  stolen  in  stage  3.  But  since  the 
thief  is  at  distance  two,  lemma  3.5  bounds  the  number  of  poor  nodes  subject  to  theft 
in  any  sphere  of  radius  4.  I 

To  prove  lemma  3.4  observe  finally  that  stage  4  only  works  inside  an  individual 
Ha-  Fix  a  poor  unassigned  node  at  the  beginning  of  stage  4.  It  has  en  topnode 
neighbors.  The  probability  that  each  individual  topnode  neighbor  does  not  have  en 
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rich  unassigned  neighbors  is  Thus  with  probabiUty  1  -  jf  each  unassigned 

poor  node  has  ~n  topnode  neighbors  with  en  unassigned  rich  neighbors  each.  Stage  3 
reduces  the  number  of  rich  neighbors  to  each  topnode  only  by  a  constant.  By  lemma 
3.8  we  know  that  during  stage  4  only  2d  other  unassigned  poor  nodes  can  interfere. 
Therefore  stage  4  is  successful  and  lemma  3.4  follows.  ■ 


If  the  algorithm  successfully  assigns  each  poor  node  to  a  rich  node  or  topnode  and 
each  pushed  topnode  to  a  rich  node,  call  the  assignment  0.  Embed  .1  \n  Hn-  F 
with  the  embedding  which  maps  nodes  in  Hn-i  to  nodes  in  —  F  by 


<t> :  y 


y 

(yy 

^(yY 


if  y  €  /fn  - 

if  (yY  €  ffn  —  F,  j/  6  F  and  y  is  not  pushed 
otherwise 


and  maps  edges  in  ffn-i  to  paths  in  —  F  precisely  as  discussed  in  section  2. 


Theorem  3.9.  Algorithm  3.2  is  a  local,  deterministic  algorithm.  For  any  p  <  ~  there 
is  a  suhScientJy  small  constant  Cn  >  0  such  that  if  the  nodes  of  Hn  f^  independently 
and  partially  with  probability  p,  for  sufEciently  large  N  the  following  is  true  with 
probability  1  -  A/"”*”.  Algorithm  3.2  takes  0(\ogN)  steps  and  constructs  a  one- 
to-one  embedding  of  Hn^i  into  Hn  —  F  with  dilation  3  and  congestion  2  log  N.  The 
embedding  has  the  property  that  if  a  constant  degree  network  C  is  embedded  in  Hn-i 
then  the  induced  embedding  in  Hn  —  F  has  constant  congestion. 

The  only  part  of  the  theorem  which  we  have  not  yet  checked  is  the  number  of  steps 
stage  4  takes.  Figure  3-6  gives  a  more  detailed  description  of  the  implementaticm  of 
stage  4.  First,  each  imassigned  poor  node  is  tentatively  assigned  to  a  constant  number 
of  topnode  neighbors.  Each  topnode  chosen  attempts  to  tentatively  assign  itself  to 
one  of  its  unassigned  rich  neighbors.  Each  poor  node  then  finds  a  topnode  to  which 
it  is  tentatively  assigned  which  successfully  was  assigned  to  a  rich  node. 

Since  by  lemma  3.8  there  are  few  unassigned  poor  nodes  in  any  small  sphere  and 
we  know  that  most  topnodes  will  have  many  rich  neighbors,  the  above  procedure  will 
assign  every  unassigned  poor  node  to  a  topnode  with  high  probability. 
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for  all  poor  nodes  v  unassigned  after  stage  3  do 
for  A:  <—  cn  +  1  to  n 

if  V*  is  an  unassigned  topnode 

assign  v  to  unless  v  is  already  assigned  to  8d  neighbors 
for  all  assigned  topnodes  u  do 
for  A:  cn  +  1  to  n 

if  u*  is  an  unassigned  rich  node  assign  u  to  u*  and  stop 
for  all  poor  nodes  v  unassigned  after  stage  3  do 
for  Ar  ♦—  cn  +  1  to  n 

if  was  assigned  to  v  and  succeeded  in  being  assigned  to  a  rich  node  w 
push  Vk  to  w  and  assign  v  to  v* 


Figure  3-6:  Stage  4. 

3.4  Routing  Using  Only  Live  Nodes 

If  we  consider  total  faults  instead  of  partial  faults,  algorithm  3.2  fails  in  several  places. 
In  fact,  any  path  in  the  embedding  which  does  not  consist  of  only  a  single  edge  has  at~ 
least  one  dead  node  internal  to  it.  In  order  to  handle  total  faults  we  will  replace  the 
paths  of  length  3  in  ffn  —  F  which  constitute  the  edges  of  the  embedded  hypercube 
with  paths  of  length  7  which  use  only  live  nodes. 

In  the  remainder  of  the  chapter  we  will  use  probabilistic  algorithms.  To  guarantee 
the  performance  of  these  algorithms,  we  will  need  to  know  that  certain  assumptions 
about  the  distribution  of  faults  hold  true.  These  assumptions  are  stated  in  several 
lemmas  (for  example,  lenunas  3.10  and  3.11).  Given  that  these  distribution  assump¬ 
tions  hold  (which  they  do  except  with  inverse  polynomial  probability),  the  algorithms 
work  with  high  probability.  The  errors  arising  during  particular  executions  of  the  al¬ 
gorithms  are  thus  in  some  sense  independent  of  the  existence  of  unusual  fault  patterns. 
First  we  establish  that  all  nodes  have  a  reasonably  large  number  of  live  neighbors. 

Lemma  3.10.  There  exists  a  cu  >  0  such  that  for  any  node  v,  the  set  N„  of  live 
neighbors  of  v  has  cardinality  at  least  en  with  probability  1  — 

Proof.  A  calculation  almost  identical  to  the  one  in  the  proof  of  lemma  3.6.  I 

With  probability  close  to  1,  all  pairs  of  nodes  are  connected  by  many  paths  each 
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Figure  3-7:  A  live  path  P'^.  The  darkened  path  simulates  an  edge  (u,v). 
of  which  contains  only  live  nodes. 

Lemma  3.11.  Suppose  every  node  ftJh  with  probability  p  <  1/2.  Then  with  prob¬ 
ability  1  ~  there  are  Cl(n^)  live  paths  of  length  at  most  7  between  any  points  u 
and  V  within  Hamming  distance  3,  where  we  choose  C13  as  in  lemma  3.10. 

Proof.  We  will  prove  only  the  case  where  the  distance  is  3;  the  other  cases  are 
similar.  Let  P  =  (u,t(;i,u;3,  v)  be  a  path  of  length  3  between  u  and  v.  The  paths  we 
will  consider  are  of  the  type  P*^  =  {u,u* ,10^  ,v*^ ,v).  Let  Af*  be  the  set  of 
dimensions  k  for  which  u*  is  live  and  similarly  for  N^.  Take  the  larger  of  the  two  sets 
€  Nu,j  €  Ny,i  <  j}  and  €  Ny^yj  €  N,yi  >  j).  By  lemma  3.10  this  set 
has  cardinality  e^n^/2.  The  interior  4  nodes  of  these  paths  are  disjoint  for  different 
pairs  iyj  (if  we  discard  t,  j  where  either  t  or  j  is  a  dimension  used  along  P)  and  the 
outer  4  nodes  are  all  alive.  Thus  with  high  probability  n(n’)  of  these  paths  use  only 
live  nodes.  I 

Once  we  have  established  the  exi8ten<»  of  live  paths  it  is  a  simple  matter  to  find 
them  algorithmically.  However,  if  we  look  for  them  deterministically  it  is  difficult  to 
bound  the  congestion.  A  random  algorithm  which  uniformly  chooses  a  random  live 
path  for  each  pair  of  nodes  is  easier  to  analyze.  Before  we  show  how  well  the  random 
algorithm  performs,  we  prove  a  simple  lenuna  about  balls  and  boxes. 
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Lemma  3.12.  If  each  of  a  balls  is  placed  randomly  and  uniformly  into  one  of  0 
boxes,  then  with  probability  1  —  {ae/ 0'iY'  there  are  fewer  than  7  balls  placed  in  the 
first  box. 

Proof.  The  probability  that  there  are  more  than  7  b<ills  in  the  first  box  is  no  more 
than  = 

Theorem  3.13.  If  we  uniformly  choose  a  random  live  path  between  each  pair  of 
chosen  nodes  at  the  end  of  algorithm  3.2,  then  with  high  probability  the  resulting 
embedding  will  have  congestion  O(log  N)  and  induced  congestion  0(log  N/  log  log  N). 

Proof.  The  estimates  will  follow  from  lemma  3.12.  The  balls  correspond  to  the  edges 
of  the  paths  of  the  embedding  and  the  boxes  are  the  edges  of  ffn-  The  paths  which 
potentially  share  a  given  edge  can  be  separated  into  classes.  We  assign  a  path  to  a 
class  depending  on  which  position  in  the  associated  live  path  the  edge  would  occupy 
if  the  live  path  were  actually  routed  through  the  edge.  We  will  then  show  that  with 
high  probability  the  congestion  due  to  the  live  paths  associated  with  any  one  class  is 
0(n).  Since  there  will  be  only  four  classes,  the  result  will  follow. 

Fix  an  edge  (s,t).  Given  a  path  P,  put  P  in  class  r  if  (s,t)  is  the  edge  along 
P'^  (reading  from  the  closest  end)  for  some  pair  (t,  j).  There  are  four  cases  we  need 
consider. 

r  =  1:  Then  s  s  u.  Since  there  are  at  most  n  —  1  paths  beginning  at  u,  there  are 
only  n  —  1  paths  of  this  sort  even  in  the  worst  case. 

r  =  2:  Then  (s,  <)  =  (uS  u'^).  There  are  n  —  1  possible  values  for  u,  each  an  endpoint 
of  at  most  n  —  1  paths  P.  Since  there  are  at  least  (cn)*/2  choices  for  (»,  j)  for  each  of 
these  O(n^)  paths,  lemma  3.12  applies  to  show  that  the  probability  that  more  than 
0(logiV/loglogiV)  of  these  paths  are  actually  chosen  to  go  through  (s,t)  is  at  most 

r  =  3:  Then  (s,t)  =  If  wi  €  ^n-i>  ^  embedded  for  the 

edge  (u,  tui).  Thus  only  one  path  of  this  t)rpe  exists  for  each  pair  (i,  j).  If 
then  the  path  P  was  embedded  for  an  edge  incident  to  u;®.  Thus  only  n  —  1  paths  of 
this  type  exist  for  each  pair  Therefore  the  total  number  of  paths  P  in  this  class 
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is  no  more  than  for  any  edge  (a,t).  Again  the  probability  that  any  one  of  these 
paths  will  actually  be  chosen  to  go  through  (s,  t)  is  no  more  than  2/(en)^.  By  lemma 
3.12  the  probability  that  more  than  O(logiV)  of  these  paths  are  chosen  that  way  is 
at  most  0(N~^). 

r  =  4:  Then  (s,t)  =  (w[^,W2^).  There  are  two  cases.  If  both  iwi.utj  € 

P  was  embedded  for  (wf,  Wj).  Thus  only  one  path  of  this  type  exists  for  each  pair 
If  Wi  €  €  /fn-ii  ^  embedded  for  an  edge  incident  to  Wi. 

Thus  only  n  —  1  paths  of  this  type  exist  for  each  pair  (i,j).  The  rest  of  the  analysis 
is  identical  to  that  of  the  previous  case. 

Thus  the  probability  that  the  congestion  is  more  than  O(log  JV)  is  at  most  0(iV~*)  x 
N  log  N/2  =  0(iV-*+i  log  N). 

To  prove  the  induced  congestion  is  0(logiV/ loglog  JV),  note  that  only  one  path 
from  class  1  can  contribute  to  the  induced  congestion.  Note  also  that  classes  3  and 
4  have  only  O(n^)  paths  in  them  which  can  contribute  to  the  induced  congestion, 
since  the  original  edges  could  not  have  been  adjacent.  Thus  the  analysis  for  induced 
congestion  due  to  classes  3  and  4  reduces  to  that  of  case  2  above.  I 


3.5  An  Algorithm  for  Constant  Delay  Embedding 

In  section  3.4  we  resorted  to  probabilistic  means  to  find  fault-free  communication 
paths.  We  will  use  probabilistic  methods  again  in  this  section,  together  with  a  more 
uniform  view  of  the  nodes  of  the  cube.  We  allow  the  max  load  to  rise  to  a  constant, 
and  in  return  we  achieve  constant  congestion. 

To  achieve  a  constant  delay  embedding,  we  need  the  load,  dilation  and  congestion 
to  all  be  constant.  The  embedding  we  will  find  will  have  a  load  and  congestion  which 
depend  strongly  on  the  probability  of  fulure  -  clearly  the  more  nodes  that  fail,  the 
more  nodes  that  have  to  be  simulated  by  any  one  processor.  However,  the  dilation 
will  always  remain  five,  and  each  processor  will  be  simulated  by  one  of  its  neighbors, 
provided  that  p  <  1  —  (about  .16). 

In  order  to  simplify  the  analysis,  each  node  (live  or  dead)  finds  a  neighbor  to 
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simulate  it.  We  first  assign  nodes  to  live  neighbors  so  that  no  node  simulates  more 
than  a  constant  number  of  its  neighbors.  Then  each  pair  of  nodes  simulating  neighbors 
finds  a  live  path  between  them  of  length  five  so  that  no  more  than  a  constamt  number 
of  these  paths  congest  any  edge.  We  will  use  two  similar  algorithms  to  accomplish 
these  two  tasks. 

3.5.1  Assigning  Nodes  to  Live  Neighbors 

Let  Ap  and  Sp  be  constants  (to  be  determined  later)  which  depend  only  upon  the 
probability  p  of  failure.  Call  a  node  unsaturated  if  it  is  live  and  if  it  has  been  assigned 
to  simulate  fewer  than  Ap  of  its  neighbors.  Otherwise,  it  is  saturated. 

The  assignment  algorithm  proceeds  in  rounds.  During  a  round,  a  previously  unsat¬ 
urated  node  might  be  picked  by  enough  unassigned  nodes  so  as  to  exceed  its  capacity 
Ap.  In  such  a  case,  we  require  the  node  to  accept  enough  of  the  simulation  requests 
to  saturate  it.  Algorithm  3.3  performs  the  first  phase. 

for  f  =  1  to  SpU 

for  each  unassigned  node  w 

w  picks  one  of  its  neighbors  uniformly 
each  unsaturated  node  v  agrees  to  simulate  as  many  nodes  as  it  can 
without  exceeding  its  capacity 
all  excess  nodes  remain  unassigned 


Figure  3-8:  Algorithm  3.3. 

Since  the  algorithm  never  assigns  a  saturated  node  to  simulate  another  node,  no 
node  simulates  more  than  Ap  nodes.  Thus,  a  constant  load  embedding  results. 

To  facilitate  our  proofs,  we  will  first  formulate  a  sequential  algorithm  similar 
to  algorithm  3.3.  We  will  prove  that  this  new  algorithm  assigns  to  each  node  a 
neighboring  node  to  simulate  it.  We  will  then  show  that,  except  for  a  small  proportion 
of  executions,  the  algorithms  behave  the  same. 

In  each  round  of  algorithm  3.4,  unassigned  nodes  act  sequentially.  Each  node 
chooses  a  neighbor  to  simulate  it  only  after  all  lower  ordered  nodes  have  chosen.  We 
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for  i  =  1  to  3pn 

for  unassigned  nodes  u;  in  arbitrary  order 

if  w  has  fewer  th2Ln  Qpn  unsaturated  neighbors 
arbitrarily  dedicate  enough  (saturated)  neighbors 
w  picks  one  of  its  neighbors  uniformly 
if  the  chosen  node  is  unsaturated  or  dedicated 
w  is  assigned  to  that  node 
else  w  remains  unassigned 


Figure  3-9:  Algorithm  3.4. 

would  like  to  ensure  that  all  nodes  have  a  large  number  of  choices  that  will  result  in  a 
successful  assignment.  Let  Qp  depend  only  upon  the  probability  p.  If  some  node  w  has 
fewer  than  a^n  unsaturated  neighbors  to  choose  from  during  its  turn,  we  designate 
an  arbitrary  set  of  saturated  neighbors  as  dedicated  to  w  during  its  turn.  If  w  chooses 
a  dedicated  node  during  that  particular  turn,  the  dedicated  node  agrees  to  simulate 
w  even  though  it  is  satiirated.  We  dedicate  enough  nodes  so  that  w  has  at  least  Opti 
neighbors  which,  if  chosen,  will  agree  to  simulate  it. 

We  will  show  below  that  with  high  probability  no  nodes  are  ever  dedicated  during 
algorithm  3.4.  In  that  case,  the  result  is  the  same  whether  unassigned  nodes  choose 
sequentially  or  in  parallel.  Thus  we  will  show  that  algorithms  3.3  and  3.4  produce 
the  same  output. 

The  following  lemma  proves  that  algorithm  3.4  terminates  quickly. 

Lemma  3.14.  With  high  probability  all  nodes  have  been  assigned  after  Spn  steps  of 
algorithm  3.4,  for  sufSciently  large  Sp. 

Proof.  Because  each  node  always  has  at  least  o^n  neighbors  which  will  simulate  it  if 
chosen,  the  probability  that  a  given  node  is  assigned  during  some  step  is  at  least  Op, 
regudless  of  what  has  occurred  in  previous  steps.  Thus  the  probability  that  a  node 
remains  unassigned  after  s^n  steps  is  no  more  than  (1  —  This  quantity  is  less 

than  as  long  as  Sp  >  kfop.  ■ 

Lemma  3.15.  For  p  <  1  —  Vb,  there  exists  an  tp  and  a  constant  cia  >  0  such  that 
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with  probability  at  least  1  —  each  node  has  at  least  Cpn  live  neighbors. 

Proof.  The  probability  that  a  node  has  fewer  than  en  live  neighbors  equals 

Since  the  ratio  of  consecutive  terms  is  always  greater  than  (1  —  p)/p,  this  sum  is 
bounded  by  a  constant  times  its  last  term.  That  term  is 

The  second  term  in  the  product  can  be  made  less  than  for  some  cu  by  taking 

e  small  enough.  The  first  term  in  the  product  can  be  made  less  than  by  taking 

c  small  enough  as  well.  The  probability  that  some  node  has  too  few  neighbors  is 
bounded  by  the  sum  of  the  probabilities  for  the  individual  nodes.  This  multiplies 
the  above  bound  by  N.  Thus  for  any  e  below  both  of  these  thresholds,  the  theorem 
applies.  ■ 

The  following  two  lemmas  show  that  with  high  probability  algorithm  3.4  never 
dedicates  saturated  nodes.  Thus  with  high  probability  algorithms  3.3  and  3.4  behave 
identically.  This  proves  that  algorithm  3.3  assigns  all  nodes  with  high  probability. 
Similar  reasoning  proves  the  Dance  Hall  Theorem  described  in  the  introduction. 

Lemma  3.16.  Given  a  failure  rate  p,  assume  that  every  node  has  at  least  e^n  live 
neighbors.  Then  with  high  probability  a  given  node  v  never  has  fewer  than  a^n 
unsaturated  neighbors  available  during  algorithm  3.4,  for  Op  = 

Proof.  For  v  to  have  fewer  than  a^n  unsaturated  neighbors  at  some  point  during 
algorithm  3.4,  at  least  (e,  —  ap)n  s  a^n  of  v’s  neighbors  must  have  become  saturated 
during  the  course  of  the  algorithm. 

Each  node  always  has  at  least  a^n  neighbors  (including  dedicated  nodes)  to  which 
it  might  be  assigned  during  any  step.  F\irther,  if  it  is  assigned,  it  is  equally  likely  to 
be  assigned  to  any  one  of  those  neighbors.  Thus  no  node  has  a  probability  greater 
than  l/opTi  that  it  will  be  assigned  to  any  ^ven  neighbor,  no  matter  what  other 
assignments  have  been  made  previously. 
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To  saturate  a^n  of  u’s  neighbors,  there  must  be  at  least  Apaj,n  nodes  at  Hamming 
distance  two  from  v  each  of  which  is  assigned  to  a  neighbor  of  v.  There  are  no  more 
than  nodes  which  might  be  assigned  to  some  node  in  v’s  neighborhood.  Each 
one  of  these  nodes  has  at  most  two  neighbors  of  v  to  which  it  might  be  assigned. 
Although  the  probabilities  of  such  selections  are  dependent,  the  probability  a  given 
node  is  assigned  to  a  neighbor  of  v  is  at  most  2/Qtpn,  no  matter  what  choices  the  other 
nodes  made.  The  probability  that  at  least  Opti  of  u’s  neighbors  become  saturated  is 
thus  no  more  than 

\Apapn)  [apnj  “  \Apal] 

For  Ap  large  enough,  this  quantity  is  an  inverse  polynomial  in  N.  I 

Lemma  3.16  implies  that  with  high  probability  algorithms  3.3  and  3.4  behave 
identically.  We  know  that  algorithm  3.4  successfully  assigns  each  node  to  a  neighbor 
with  high  probability  and  that  algorithm  3.3  never  assigns  more  than  Ap  nodes  to 
any  node.  We  conclude  that  algorithm  3.3  achieves  a  constant  load  embedding  with 
high  probability. 

3.5.2  Assigning  Edges  to  Paths 

Once  we’ve  assigned  simulating  nodes,  we  need  to  find  paths  to  simulate  the  edges 
in  the  hypercube.  Say  that  simulates  v  and  v**'  simulates  v*.  Then  to  simulate 
the  edge  (v,v*),  the  nodes  v*  and  choose  a  path  between  them  of  the  form 
P(t;,u*,6,  y,  r)  =  (u*,  To  avoid  ambiguity,  we  will  refer  to  the 

choice  of  r  as  if  it  were  made  by  v  and  v*  even  though  and  actually  choose. 

For  two  adjacent  nodes  v  and  v*,  let  5(v,  v*,  6, 1/)  be  the  set  of  dimensions  r  ^  ib 
for  which  P{v,v^,b,V,r)  is  a  live  path.  Because  p  <  1  —  there  is  a  chance 
(1  —  p)^  =s  s  >  ^  that  any  given  path  P{v^v^,b,l/,r)  is  live.  Note  that  the  paths 
P(v,  v^,  6,  V,  r)  (r  Jb)  are  node-disjoint  for  a  fixed  choice  of  v,  v*,  b  and  V.  Thus  the 
probability  that  any  one  of  them  is  live  is  independent  of  the  other  paths. 

Lemma  3.17.  With  high  probability,  for  all  quadruples  {v,  u*,  b,  V),  15(t;,  v*,  6,  V)!  > 
qpTi  for  some  constant  rjp. 
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Figure  3-10:  A  choice  of  live  path. 


Proof.  Same  as  lemma  3.15,  except  that  there  are  N\og^  N  different  quadruples.  ■ 


With  high  probability,  we  know  that  all  pairs  of  neighbors  have  many  paths  from 
which  to  choose.  What  remains  is  for  them  to  decide  in  a  systematic  but  local  fashion 
how  to  choose  from  among  these  paths  without  congesting  any  edge  too  much.  In 
the  rest  of  this  section,  we  explore  a  way  to  choose  paths  in  this  manner. 

Take  a  node  v  simulated  by  its  neighbor  and  consider  the  set  of  edges 
There  are  2n*  nodes  w  (all  of  the  form  ts  =  v’’*  or  u;  =  which  (like 
v)  might  potentially  use  one  of  the  edges  in  the  set  as  a  second  edge  along  a  path. 
Any  node  which  actually  does  must  be  simulated  by  its  neighbor  across  dimension  b. 
The  next  lemma  bounds  the  number  of  such  nodes. 


Lemma  3.18.  For  suSciently  Ivge  Sp  and  with  high  probability,  of  the  2v?  nodes 
at  distance  0  or  2  from  either  v  or  v^,  no  more  than  SpTi  of  them  are  simulated  by 
neighbors  across  dimension  b. 


Proof.  As  noted  before,  each  node  has  a  probability  no  more  than  l/orpn  of  borrowing 
across  amy  given  dimension,  regardless  of  the  choices  made  by  other  nodes.  The 
probability  that  many  nodes  choose  across  the  same  dimension  is  no  more  than 
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Of  course,  the  actual  probabilities  depend  on  the  particular  ^^n-size  subset  we  con¬ 
sider  and  on  the  relative  order  in  which  the  nodes  of  the  subset  successfully  found 
neighbors  to  simulate  them.  Then  any  node’s  probabilities  are  conditioned  upon 
other  nodes’  previous  choices.  No  matter  how  these  choices  are  made,  however,  the 
stated  probabilities  are  upper  bounds  on  the  actual  probabilities  since  when  each 
node  chooses  it  always  has  at  least  a^n  choices. 

For  sufficiently  large  this  is  smaller  than  an  inverse  polynomial  in  iV.  ■ 

Each  of  the  at  most  SpTi  nodes  (except  for  v  and  v^)  can  use  at  most  two  edges 
in  the  set  Ev,6  ^  a  second  edge  along  some  path.  To  use  an  edge  as  a  second  edge, 
such  a  node  would  have  to  be  a  neighbor  of  one  of  the  nodes  incident  to  the  edge.  If 
w  is  of  the  form  w  =  then  w  is  adjacent  to  v'  and  v*  and  no  other  node  incident 
to  an  edge  in  Ev,6-  Similar  reasoning  applies  to  nodes  w  which  satisfy  w  =  v^. 
Trivially,  each  of  v  and  can  use  no  more  than  n  edges  of  as  a  second  edge 
along  some  path.  If  we  sum  over  all  edges  in  the  number  of  nodes  which  can  use 
each  edge  as  a  second  edge  counting  according  to  multiplicity,  the  total  will  be  no 
more  than  {26p  +  2)n.  Therefore  no  more  than  »7pn/4  of  these  edges  will  have  more 
than  »  4(25p  +  2)/tjp  of  those  6pn  nodes  potentially  using  them  as  second  edges. 
Let  S'{v,b)  =  {r|  more  than  nodes  can  send  a  path  through  the  edge 
Then  |5'(v,6)l  <T}pnfA. 

Let  7’(i;,t>*,6,y)  =  5(v,v*,6,y)  -  S'{v,b)  -  5'(v*,y).  Then  for  each  adjacent 
pair  of  nodes  v  and  v*,  |r(v,t;*,6,  y)|  >  ripnf2.  The  sets  T{v,v^,b,V)  will  be  crucial 
for  our  reasoning.  The  probability  that  a  pair  successfully  choose  a  path  between 
them  is  lower  bounded  by  the  probability  that  they  successfully  choose  the  path  from 

r(v,v*,6,y). 

Note  that  among  the  edges  in  all  the  paths  represented  by  the  sets  T{v,v^,b,V), 
there  are  now  only  a  logarithmic  number  of  quadruples  which  might 

potentially  congest  any  given  edge.  We’ve  already  limited  the  ntunber  of  paths  for 
which  the  edge  is  the  second  edge  along  the  path.  If  the  edge  is  the  first  edge  along 
the  path,  then  one  of  the  edge’s  endpoints  is  the  simulating  node.  Each  endpoint 
simulates  only  a  constant  number  of  nodes,  and  each  simulated  node  contributes 
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exactly  n  paths.  If  the  edge  is  the  third  edge  along  the  path,  then  the  path  is 
simulating  an  edge  at  Hamming  distance  one  from  the  edge  considered.  There  are 
exactly  n  edges  of  this  type.  The  cases  in  which  the  edge  is  the  fourth  or  fifth  edge 
adong  the  path  are  identical  to  the  first  two  cases.  Thus  each  edge  can  be  potentially 
congested  by  no  more  than  fx^n  =  (4i4p  +  27p  +  l)n  paths. 

We  can  now  describe  algorithm  3.5,  which  assigns  paths  to  simulate  edges.  During 
adgorithm  3.5,  each  edge  will  decide  whether  or  not  to  accept  some  path  routed 
through  it.  Because  the  other  edges  in  the  path  simultaneously  decide  whether  or 
not  to  accept  the  path,  it  is  possible  that  some  might  accept  it  while  others  reject  it. 
If  this  happens,  we  assume  that  an  accepting  edge  counts  the  path  as  contributing 
to  its  load  anyway.  Call  an  edge  saturated  if  it  has  accepted  exactly  Bp  paths  routed 
through  it.  Otherwise,  call  it  unsaturated.  Order  the  pairs  (v,  v^)  lexicographically. 
As  before,  in  any  round  an  edge  accepts  an  arbitrary  set  of  pairs  which  try  to  route 
through  it  until  it  reaches  its  capacity. 

for  t  =  1  to  SpTi 

for  each  unassigned  adjacent  pair  of  nodes  (v,  v*) 

(v,v^)  pick  a  path  between  them  uniformly 
each  unsaturated  edge  agrees  to  as  many  paths  routed  through  as  it  can 
without  exceeding  its  capacity,  deciding  conflicts  arbitrarily 
all  excess  pairs  remain  unassigned 


Figure  3-11:  Algorithm  3.5. 

Parallelling  what  we  did  before,  we  will  present  algorithm  3.6,  a  sequential  ver¬ 
sion  of  algorithm  3.5.  We  will  show  that  this  modified  algorithm  terminates  having 
assigned  paths  between  every  pair  of  nodes  simulating  neighbors,  with  high  proba¬ 
bility.  Maintaining  the  parallel  with  what  we  proved  earlier  in  this  section,  we  will 
then  show  that  the  two  algorithms  perform  indistinguishably,  with  high  probability. 
At  any  time  when  the  pair  (v,u*)  attempt  to  choose  a  path  between  them  during 
algorithm  3.6,  let  C/(v,  v*,  6,  V)  be  the  subset  of  T{v,  v*,  6,  hf)  consisting  of  dimensions 
r  for  which  all  of  the  edges  along  P(i/,  v*,  b,  V,  r)  are  unsaturated.  Define  the  ded- 
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ication  of  a  path  containing  a  saturated  edge  in  a  fashion  similar  to  the  dedication 
of  saturated  neighbors  before.  We  dedicate  paths  to  the  pair  (r,  u*)  whenever  0pn 
choices  for  a  simulating  path  do  not  exist. 


for  i  =  1  to  sj,n 

for  all  unassigned  pairs  {v^v^)  in  arbitrary  order 
if  |f/(u,  u*,  6,  y)|  <  0pn 

dedicate  enough  r  €  T(t;,  u*,  b,  V) 

(v,v^)  pick  a  path  between  them  uniformly 
if  the  chosen  path  is  unsaturated  or  dedicated 
(v,v^)  is  assigned  to  the  path 
else  (v,  remains  unassigned 


Figure  3-12:  Algorithm  3.6. 

Lemma  3.19.  For  a  suitably  large  choice  of  the  coastant  s^,  with  high  probability 
all  pairs  of  nodes  searching  for  an  assignment  to  a  path  have  been  assigned  one  after 
SpTi  steps  of  algorithm  3.6. 

Proof.  Each  pair  is  successfully  assigned  with  probability  at  least  0p  during  any  step. 
The  rest  of  the  proof  is  identical  to  that  of  lemma  3.14.  ■ 

We  now  show  that  with  high  probability  algorithm  3.6  never  adds  dedicated  paths 
with  saturated  edges  to  any  U{v,v*‘,bjV).  Thus  with  high  probability  algorithms  3.5 
and  3.6  behave  identically.  This  proves  that  algorithm  3.5  assigns  all  necessary  paths 
with  high  probability. 

Lemma  3.20.  With  high  probability  no  set  U{v,v^,b,V)  ever  has  cardinality  less 
than  0pn  at  the  beginning  of  some  step  of  algorithm  3.6,  given  Pp  =  >7p/4. 

Proof.  There  are  at  most  ppU  pairs  which  have  a  non-zero  probability  of  congesting 
a  given  edge  on  some  path  represented  by  an  r  €  T{v,v^,b,V).  Thus  at  most  5/ipn^ 
pairs  have  non-zero  probability  of  congesting  any  of  those  edges,  counting  according  to 
multiplicity.  For  a  path  to  leave  U{v,  v*,  b,  F)  one  of  its  edges  must  become  saturated. 
For  (Tfp/2  —  ^p)n  =  paths  to  become  unavailable,  Bp^pti  pairs  must  choose  a  path 
crossing  an  edge  on  some  path  represented  by  an  r  €  T(u,u*,  6,  F). 
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The  probability  that  a  pair  chooses  any  particular  path  is  at  most  l/fipn,  no  matter 
what  other  choices  are  made.  Thus  if  there  are  paths  that  a  particular  pair 
(uj,uy  )  might  choose  which  contain  an  edge  on  some  path  in  r(v,v*,6,y),  then  the 
probability  that  {w,xiP)  chooses  such  a  path  is  at  most  qy,,y^/0pn,  and  q^,^  < 

By  a  moment  generating  function  argument  similar  to  those  in  lemmas  2.1  and  2.7 
and  in  theorem  2.13,  the  probability  that  more  than  0pn  paths  become  unavailable 
is  therefore  no  more  than  0{N~^)  for  arbitrary  k.  I 

With  high  probability  0(n)  steps  are  sufficient  to  select  all  paths.  Since  we  have 
guaranteed  that  the  paths  have  constant  congestion,  this  proves  the  following  theorem. 

Theorem  3.21.  For  escb  p  <  1  —  (about  .16)  there  exists  an  €p  and  an  rfp 
such  that  with  probability  1  —  at  least  CpU  neighbors  of  every  node  are  live 

and  |5(v,  v*,6,6')|  >  for  all  quadruples  (v,v*,  &,&').  Given  these  facts  hold,  there 
is  an  0(log  N)  step  algorithm  which  with  high  probability  finds  an  embedded  fully 
functioning  N-node  cube  m  Hn  —  F  with  constant  load,  dilation  and  congestion.  The 
paths  which  simulate  the  edges  of  the  cube  only  use  live  nodes. 


3.6  Implementing  the  Constant  Delay  Embedding 

As  given  so  far,  the  algorithms  of  the  previous  section  are  far  from  implementable. 
Each  node  needs  to  know  information  about  which  nodes  have  decided  to  simulate 
which  other  nodes,  which  paths  it  may  route  through,  whether  or  not  certain  tentative 
assignments  have  been  finalized,  and  so  forth.  In  this  section  we  will  show  how  such 
information  might  be  exchanged  in  polylogarithmic  time  per  step.  This  implies  that 
the  embedding  of  the  previous  section  is  obtainable  in  polylogarithmic  time. 

Focus  on  any  particular  node  v.  Because  v  might  be  faulty,  one  of  its  neighbors 
must  choose  a  simulating  node  for  it.  Arbitrarily,  we  will  use  the  lexicographically 
smallest  labelled  live  neighbor  to  simulate  v  during  the  course  of  algorithm  3.3.  First, 
the  neighbors  must  agree  on  which  one  of  them  is  the  lowest.  During  any  step  of 
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algorithm  3.3,  that  neighbor  of  v  must  inform  all  the  other  neighbors  which  one 
of  them  V  selected  during  that  step.  Both  of  these  operations  are  trivial  once  we 
understand  how  a  node’s  neighbors  can  communicate  even  with  faults. 

Each  node  v*  broadcasts  to  u’s  other  neighbors  by  first  broadcasting  to  all  of  its 
neighbors.  Then  each  node  u'-'  passes  the  information  to  its  unique  other  neighbor 
which  is  also  a  neighbor  of  u,  the  node  v^.  A  picture  of  this  type  of  broadcasting 
appears  in  figure  3-13. 


Figure  3-13:  Broadcasting  to  other  neighbors. 

We  only  care  if  the  message  gets  through  to  the  other  neighbors  of  v  which  are  live. 
The  broadcast  we  have  described  sends  the  messages  through  a  set  of  intermediary 
nodes,  several  of  which  are  likely  to  be  faulty.  Thus  if  each  node  broadcasts  just  once, 
we  might  expect  that  several  nodes  will  not  receive  the  information  they  need.  We 
remedy  this  problem  by  allowing  each  node  to  broadcast  its  information  and  then 
repeating  the  broadcast  twice  more.  With  probability  1  —  ,  every  neighbor  of  v 

is  informed  of  the  activity  of  all  of  v’s  other  neighbors.  We  prove  this  scheme  works 
by  showing  that  with  probability  1  —  AT"®**,  for  every  node  v  of  the  hypercube,  every 
live  pair  v*  and  of  v’s  neighbors  are  connected  by  a  live  path  consisting  of  two  or 
three  broadcasts. 

Lemma  3.22.  With  probability  1  -  iV”***,  between  every  pair  of  live  neighbors 
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v\  of  every  node  v  there  is  a  live  path  of  the  form  (u',  v‘*,  u*,  v'‘\v^)  or  of  the  form 
{v\  u'*,  u*,  v^\v^). 

Proof.  Consider  a  node  v  and  a  neighbor  v‘.  The  neighbor  u'  successfully  broadcasts 
to  another  neighbor  exactly  when  both  and  the  intermediary  node  are  both 
live.  Since  the  probability  of  failure  is  no  more  than  1  —  v^,  the  probability  that 
one  or  both  of  these  nodes  are  dead  is  no  more  than  1  —  \/C5.  Further,  none  of  these 
pairs  {u*,  t;*''}  share  a  common  node.  Thus  there  is  a  />  >  0  and  a  Cir  >  0  such  that 
with  probability  1  —  pn  of  the  pairs  will  be  live,  for  all  neighbors  u'  of  v. 

Next  take  two  disjoint  sets  Si  and  S3  each  containing  pn  neighbors  of  a  given 
node  V  and  consider  the  set  r(5i,5a)  =  {v**jv*  €  S^v^  €  S3  and  v**  is  live}.  Since 
there  are  p^n^  pairs  of  nodes  v^,  which  satisfy  the  first  two  requirements  and  each 
pair  has  a  constant  probability  that  it  satisfies  the  last  requirement  (independent  of 
other  nodes),  with  high  probability  the  set  T{S\,  S3)  is  nonempty.  There  are  no  more 
than  ways  to  choose  the  sets  Si  and  S3  for  any  given  node  t>.  Thus  with  high 
probability  the  set  r(5i,  5^)  is  nonempty  for  all  choices  of  Si  and  S3.  Since  there  are 
only  N  choices  for  v,  with  high  probability  for  each  node  v  and  each  choice  of  Si  and 
S3,  the  set  T{%i,S3)  is  nonempty. 

With  probability  1  -  (for  any  0  <  Ci«  <  Cjr),  the  conclusions  of  the  first 

two  paragraphs  hold.  For  a  given  node  v,  let  Vi  =  and  v*  are  both  live}  and 

V3  =  and  v*  are  both  live}.  Then  if  \Vi  H  V3I  9^  0,  a  the  path  of  length  four 

connects  v'  and  t^.  If  IVi  n  V3I  »  0  then  v’  and  v*  are  connected  by  a  path  of  length 
six.  I 

Before  algorithm  3.5  can  route  the  simulating  paths,  each  node  must  know  which 
nodes  simulate  the  neighbors  of  the  nodes  it  simulates.  At  the  end  of  algorithm  3.3 
each  live  node  knows  which  nodes  simulate  each  of  its  neighbors.  At  least  pn  neighbors 
of  every  node  are  live.  For  every  pair  of  neighbors  v  and  w*,  we  only  need  some  live 
neighbor  of  v  to  communicate  with  some  live  neighbor  of  u*.  Each  neighbor  v*  of  v 
attempts  to  route  along  the  path  to  each  neighbor  v'*  of  u*.  We  are 

only  interested  in  the  (pn)^  paths  which  begin  and  end  at  live  nodes.  Each  of  these 
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paths  intersects  at  most  one  other  and  with  high  probability  one  of  the  (pn)^/2  node- 
disjoint  paths  will  be  nonfaulty.  Since  there  are  only  a  polynomial  number  of  pairs  of 
neighbors,  each  with  only  a  polynomial  number  of  possible  sets  of  live  neighbors,  this 
communication  will  be  possible  with  high  probability.  A  series  of  three  broadcasts  by 
all  nodes  will  accomplish  the  task  in  0(log^  N)  steps. 

Finally,  we  will  describe  how  to  implement  a  slight  variant  of  algorithm  3.5.  As¬ 
sume  that  V  has  am  even  number  of  I’s  in  its  bit- vector  representation.  To  avoid 
confusion,  to  find  a  simulating  path  for  the  edge  (v,t;*),  only  the  node  simulat¬ 
ing  V  will  actually  choose  a  path.  Say  that  during  a  step  of  the  algorithm,  instead 
of  choosing  a  random  dimension  in  U(v,v^,b,bf),  chooses  a  random  dimension 
from  {1,2, ...,n}.  Then  we  know  (1)  the  probability  that  u*  chooses  any  partic¬ 
ular  r  €  U{v,v‘‘,b,V)  does  not  increase,  (2)  all  sets  U{v,v'‘,b,V)  have  cardinality 
log  N  with  at  least  the  same  probability  as  in  algorithm  3.5  and  (3)  each  node 
has  probability  at  least  0p  of  choosing  an  r  €  £^(v,  v*,  6,  V). 

During  any  step  of  this  modified  algorithm,  all  of  the  nodes  that  have  chosen 
an  r  €  f/(v,  u*,&,  &')  succeed  with  at  least  the  probability  stated  in  the  analysis  of 
algorithm  3.5.  All  other  nodes  may  or  may  not  find  an  unsaturated  path  and  may 
or  may  not  encounter  too  much  congestion.  With  high  probability,  if  we  run  the 
modified  algorithm  2/)9p  times  as  long  as  algorithm  3.5,  each  node  will  choose  an 
r  €  U{v,v^,b,hf)  at  least  as  many  times  as  it  did  in  algorithm  3.5.  Thus,  even  if 
nodes  never  find  simulating  paths  except  when  they  choose  an  r  €  U{v,v^,b,V),  all 
nodes  will  find  the  necessary  paths  at  least  as  successfully  as  before. 

Each  node  that  chooses  a  path  attempts  to  route  a  message  describing  the  path 
along  the  path.  Any  node  along  the  path  can  send  messages  back  if  it  detects  too 
much  congestion  along  one  of  its  edges.  If  the  message  comes  back,  the  even  node 
knows  that  it  was  unsuccessful.  Otherwise,  both  the  even  node  and  the  odd  node 
which  is  the  message’s  destination  know  which  path  to  use  in  the  future. 
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3.7  Extensions  and  Remarks 

As  mentioned  in  the  introduction,  edge  faults  are  easily  handled  once  node  faults  are 
understood.  Say  each  edge  fails  with  probability  p*,  each  node  fails  with  probability 
Pn  and  the  failure  of  any  component  is  independent  of  the  failure  of  other  components. 
Then  all  results  still  follow  with  little  change.  Specifically,  as  long  as  pn  +  Pe  —  PnPe  < 

1  —  (about  .13),  the  algorithms  of  sections  3.5  and  3.6  work  with  high  probability. 
The  only  addition  to  our  reasoning  is  that  when  one  node  tries  to  communicate  with 
a  neighbor  node,  it  is  unsuccessful  not  only  if  the  neighbor  is  faulty  but  idso  if  the 
link  between  them  has  failed. 

This  work  extends  to  the  case  in  which  p  is  small;  that  is,  if  p  <  N°  for  0  >  a  >  —  1 . 
In  this  case,  faults  are  so  far  between  that  the  results  of  the  second  section  can  be 
strengthened.  The  deterministic  algorithm  3.1  achieves  a  constant  delay  embedding 
with  high  probability.  This  result  follows  directly  from  the  following  fact. 

Lemma  3.23.  If  faults  occur  with  probability  p  for  small  p  then  with  high  probability 
no  sphere  of  radius  14  contains  more  than  a  constant  number  of  faults. 

Proof.  Say  p  <  N**.  Then  the  probability  that  m  nodes  out  of  any  given  nodes 
are  faulty  is  no  more  than 

There  are  at  most  N  such  spheres  to  consider.  If  m  >  —1/a,  the  total  probability 
that  some  sphere  contains  m  faults  is  an  inverse  polynomial  whose  exponent  can  be 
diminished  by  increasing  m.  ■ 

Each  simulating  node  only  needs  to  distribute  its  connections  among  the  dilation 
7  fault-free  paths  discovered  in  section  3.4. 

Last,  a  word  about  the  practicality  of  the  results  of  this  chapter.  We  have  made 
little  attempt  to  optimize  constants  since  we  need  large  constants  to  obtain  the  full 
breadth  of  our  results.  However,  in  practice  the  full  strength  of  these  theorems  will 
probably  be  unnecessary.  We  cannot  expect  half  the  processors  in  a  network  to  fail  as 
a  normal  occurrance.  We  are  optinustic  that  when  the  number  of  faults  is  moderate. 
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our  t6clini<}U6s  will  work  Quite  well  either  od  their  own  or  ss  &  b&sis  for  some  heuristic 
approach. 
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Chapter  4 


Embedding  Trees  Dynamically 


4.1  Introduction 

Achieving  high  performance  on  a  parallel  computer  requires  the  satisfaction  of  two 
potentially  conflicting  requirements.  First,  the  computational  load  posed  by  the 
program  should  be  evenly  shared  among  all  processors  (load  balancing).  Second, 
processes  communicating  frequently  should  be  placed  on  processors  that  are  close 
(communication  locality). 

This  problem  has  been  studied  abstractly  as  the  problem  of  embedding  a  pro¬ 
cess  graph  (?  in  a  processor  graph  H  ([BCHLR],  [BCLR],  [BI],  [C],  [GHR],  [HJ], 
[KLMRR]).  The  vertices  of  G  are  processes  comprising  the  parallel  program,  with 
edges  representing  communication  between  processes.  The  vertices  of  H  are  proces¬ 
sors,  and  the  edges  represent  communication  channels.  For  many  computations,  it  is 
possible  to  predict  G  before  execution.  In  such  cases  it  is  useful  to  map  the  vertices 
of  G  into  those  of  /f  so  as  to  minimize  load,  dilation  and  congestion. 

This  chapter  focuses  on  embedding  arbitrary  binary  trees  into  the  butterfly  and  hy¬ 
percube  networks.  Trees  arise  naturally  in  many  computations:  divide-and-conquer 
algorithms,  branch- and- bound  search  ([KZ]),  functional  expression  evaluation,  and 
image  understanding  (quad/oct  trees).  In  (BCLR],  Bhatt  et  al.  showed  that  every 
iV-node  binary  tree  could  be  embedded  in  an  AT-processor  hypercube  such  that  each 
processor  received  a  single  tree  node,  and  the  maximum  dilation  was  0(1).  Embed- 
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ding  trees  into  butterfly  networks  is  harder,  because  the  butterfly  is  much  sparser 
than  the  hypercube.  In  [BCHLR],  Bhatt  et  al.  showed  how  to  embed  the  complete 
binary  tree  with  N  nodes  in  a  butterfly  network  with  N  processors  with  constant 
dilation  and  load.  The  problem  of  embedding  arbitrary  trees  into  butterfly  networks 
was  left  open. 

Tree  structured  computations  are  often  dynamic.  As  the  computation  progresses, 
the  tree  may  grow  or  shrink,  in  a  manner  which  may  be  impossible  to  predict  before¬ 
hand.  In  [BC],  Bhatt  and  Cai  propose  a  dynamic  version  of  the  embedding  problem. 
They  consider  a  process  graph  which  is  a  binary  tree  that  can  grow  during  execution. 
At  each  step  any  node  of  the  tree  that  does  not  have  two  children  can  request  to 
spawn  a  child.  The  dynamic  embedding  problem  is  harder  than  the  static  one  since 
newly  spawned  children  must  be  allocated  to  processors  incrementally,  without  mak¬ 
ing  assumptions  about  how  the  tree  will  grow  in  the  future.  Further,  the  placement 
decision  must  itself  be  implemented  within  the  network  in  a  distributed  manner  with¬ 
out  accessing  global  information.  The  paradigm  proposed  by  Bhatt  and  Cai  disallows 
process  migration;  i.e.  once  a  process  is  placed  on  a  particular  processor,  it  cannot  be 
moved  subsequently.  Obviously,  allowing  migration  can  potentially  give  better  load 
balancing/dilation  but  can  also  be  extremely  expensive  in  practice. 

Bhatt  and  Cai  present  ([BC])  a  randomized  algorithm  for  dynamically  growing 
trees  with  M  vertices  on  an  IV  processor  binary  hypercube.  Each  child  process  is 
placed  no  farther  than  a  distance  O(logloglV)  &om  its  parent.  Further,  with  high 
probability  (independent  of  the  tree  shape)  the  algorithm  only  assigns  0{MfN  +  1) 
vertices  to  each  processor.  The  congestion  of  the  embedding  was  not  determined  but 
is  probably  on  the  order  of  log  N. 

4.1.1  Summaxy  of  Results 

We  consider  the  problem  of  growing  trees  on  butterfly  and  hypercube  networks.  Our 
framework  is  identical  to  that  of  Bhatt  and  Cai  ([BC]),  although  our  growth  algo¬ 
rithms  are  substantially  simpler  and  have  provably  better  performance.  We  begin  by 
describing  a  level-by-level  strategy  for  embedding  a  binary  tree  in  a  butterfly.  Mod- 


82 


ifications  to  this  scheme  form  the  basis  of  all  our  embedding  algorithms.  The  first 
modification  we  introduce  is  the  use  of  random  flip  bits,  which  randomize  the  loca¬ 
tions  of  tree  nodes  within  a  level  of  the  butterfly.  Analysis  of  the  behavior  of  these 
flip  bits  is  sufficient  to  prove  our  first  result. 

Theorem  4.1.  An  arbitrary  binary  tree  T  with  M  vertices  can  be  dynamically 
grown  on  an  N  processor  bypercube  with  dilation  1  such  that  with  high  probability 
the  maximum  load  per  processor  is  0(M/N  +  log  N). 

Note  that  this  is  optimal  to  within  a  constant  factor  whenever  the  tree  T  is  large 
(i.e.,  M  >  NlogN).  For  these  large  trees,  it  gives  an  optimal  0{M/N)  load  as 
in  [BC]  while  improving  dilation  from  0(loglog./V)  to  1.  Next  we  present  another 
modification  of  the  scheme  involving  level  balancing  —  in  effect,  we  stretch  certain 
paths  within  the  tree  so  that  the  number  of  tree  nodes  assigned  to  any  level  of  the 
butterfly  is  balanced.  This  modification  leads  to  our  next  result,  this  time  for  a 
butterfly. 

Theorem  4.10.  An  arbitrary  binary  tree  T  with  M  vertices  can  be  dynamically 
grown  on  an  N  processor  butterSy  network  with  dilation  2  such  that  with  high  prob¬ 
ability  the  maximum  load  per  processor  is  at  most  0(M/N  +  log  N) 

Again,  this  is  optimal  to  within  a  constant  factor  when  M  >  N\og N.  This  result 
is  a  substantial  improvement  over  previous  work  since  not  even  good  static  embed¬ 
dings  of  arbitrary  binary  trees  were  known.  Finally,  we  take  advantage  of  an  embed¬ 
ding  of  the  butterfly  into  the  hypercube  which  embeds  entire  levels  of  the  butterfly 
to  subcubes  of  the  hypercube  in  order  to  develop  a  scheme  for  local  redistribution  of 
load  within  levels.  This  leads  to  an  embedding  algorithm  for  the  hypercube  which 
simultaneously  optimizes  maximum  load  and  dilation.  In  addition,  the  congestion  of 
the  embedding  is  optimal  if  M  =  0{N). 

Theorem  4.14.  An  arbitrary  tree  T  with  M  vertices  can  be  grown  onaN  processor 
hypercube  with  constant  dilation  such  that  with  high  probability  the  maximum  load 
is  0{M/N  +  1)  and  the  congestion  is  0{M/N  -b  1). 

It  should  be  noted  that  although  our  theorems  are  phrased  in  terms  of  trees 
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which  only  grow,  these  embedding  algorithms  are  also  effective  for  dynamic  trees 
which  can  both  grow  and  shrink  at  their  leaves.  Consider  a  binary  tree  T  which 
grows  and  shrinks.  At  each  stage  in  the  tree’s  evolution,  the  probability  space  of 
possible  embeddings  of  the  current  form  of  the  tree  T*  is  equivalent  to  the  space  of 
embeddings  which  would  have  occured  had  we  simply  grown  the  tree  T'  using  the 
same  algorithm.  Therefore  the  same  results  hold  for  each  step  in  the  tree’s  evolution 
(assuming,  of  course,  that  the  total  number  of  steps  in  the  tree’s  evolution  is  bounded 
by  a  polynomial  in  N). 

We  also  prove  a  lower  bound  for  deterministic  embedding  algorithms  for  hyper¬ 
cubes  which  shows  that  any  deterministic  algorithm  which  balances  load  must  nec¬ 
essarily  have  dilation  n(\/log  N).  It  follows  that  any  embedding  algorithm  which 
simultaneously  optimizes  load  and  dilation  (to  within  constant  factors)  must  be  ran¬ 
domized.  This  consequence  also  holds  for  the  butterfly,  since  it  is  a  subgraph  of  the 
hypercube. 

Tom  Leighton,  Abhiram  Ranade  and  Eric  Schwabe  coauthored  all  the  work  ap¬ 
pearing  in  chapter  four. 

4.1.2  Overview 

The  basic  embedding  algorithm  is  presented  in  section  4.2  along  with  the  introduction 
of  flip  bits  and  the  proof  of  theorem  4.1.  The  level- balancing  scheme  is  introduced 
and  analyzed  in  section  4.3,  along  with  a  proof  of  theorem  4.10.  Improvements  to  the 
hypercube  embedding  algorithm  and  proof  of  theorem  4.14  are  given  in  section  4.4. 
Section  4.5  states  and  proves  the  lower  bound  for  deterministic  algorithms. 


4.2  The  Basic  Growth  Algorithm 

4.2.1  Preliminary  Scheme 

We  begin  v.ith  a  level-by-Uvel  strategy  for  growing  a  tree  on  an  N-node  butterfly 
network.  For  this  chapter,  we  set  n  so  that  N  =  n2’*.  That  is,  the  Af-node  butterfly 
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has  n  levels. 

In  the  cases  where  we  are  ultimately  interested  in  an  embedding  in  a  hypercube, 
we  will  first  embed  the  tree  in  a  butterfly,  and  then  consider  some  embedding  of 
the  butterfly  in  the  hypercube.  We  place  the  root  of  the  tree  on  processor  Oo  in 
the  butterfly.  This  processor  is  connected  to  two  processors  in  level  1,  on  which  we 
place  the  children  of  the  root.  These  processors  are  in  turn  connected  to  4  level  2 
processors,  which  will  in  turn  receive  the  children  of  the  root’s  children,  and  so  on. 
This  strategy  enables  us  to  grow  any  n  level  binary  tree  with  dilation  1,  and  with  at 
most  one  tree  vertex  per  butterfly  processor.  Trees  with  greater  height  are  wrapped 
around;  i.e.  level  n  vertices  are  placed  in  butterfly  level  0,  and  so  on.  The  set  of  tree 
vertices  which  are  mapped  to  level  t  of  the  n  level  butterfly  consists  of  those  vertices 
in  levels  i,i  +  n,i  +  2n  . . .  and  so  on;  we  refer  to  this  as  the  level  set  of  the  tree. 
There  are  two  issues  we  need  to  consider: 

1.  Evenly  distributing  tree  vertices  within  the  processors  in  each  level.  We  would 
like  the  vertices  belonging  to  level  set  i  to  be  evenly  distributed  among  the 
processors  in  the  level  of  the  butterfly;  i.e.  to  guarantee  that  no  single 
processor  in  level  t  receives  too  many  vertices. 

2.  Evenly  distributing  tree  vertices  among  different  butterfly  levels.  For  example, 
when  mapping  a  complete  binary  tree  of  height  A,  level  h  —  1  nx>d  n  of  the 
butterfly  would  receive  the  leaves  of  the  tree,  or  about  half  the  total  number  of 
vertices.  Ideally,  we  would  like  the  vertices  to  be  divided  evenly  among  all  the 
levels  of  the  butterfly. 

We  will  defer  our  consideration  of  the  second  issue  until  section  4.3.  First,  a 
modification  of  the  basic  scheme  helps  us  achieve  balance  within  a  level. 

4.2.2  Flip  Bits 

A  random  flip  bit  is  generated  at  each  vertex  of  the  tree  to  decide  where  its  children 
will  be  spawned.  Consider  a  vertex  v  of  the  tree  that  has  been  placed  on  some 
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processor  p  in  level  i  of  the  butterfly.  This  node  is  connected  to  processors  q  and  r  in 
level  t  +  1  mod  n,  which  will  receive  the  children  of  v.  The  flip  bit  chosen  for  vertex  v 
decides  whether  the  left  child  of  v  will  be  placed  on  q  or  on  r.  The  right  child  is  then 
placed  on  the  other  processor.  Note  of  course  that  it  is  not  necessary  that  v  have  two 
children  -  the  bit  only  determines  where  the  children  will  be  placed  if  they  are  ever 
spawned. 

In  section  4.3  we  will  show  that  this  ensures  even  distribution  within  eaurh  level. 
Intuitively,  each  vertex  is  effectively  placed  using  a  random  path  determined  by  the 
flip  bits  chosen  along  its  ancestors.  For  now,  this  modified  scheme  is  sufficient  to 
prove  theorem  4.1. 

Theorem  4.1.  An  arbitrary  binary  tree  T  with  M  vertices  can  be  grown  dynamically 
on  an  N  processor  hypercube  with  dilation  1  such  that  with  high  probability  the 
maximum  load  per  processor  is  0(M/N  +  log  N). 

Theorem  4.1  follows  directly  from  the  following  lemma. 

Lemma  4.2.  An  arbitrary  tree  T  with  M  vertices  can  be  grown  in  a  butterfy  net¬ 
work  of  N  processors  such  that  each  column  in  the  butterSy  receives  no  more  than 
0(M/2’*  +  n)  vertices  with  high  probability. 

Suppose  this  lemma  were  true.  Then  by  simulating  the  N  =  n2"-node  butterfly  by 
a  2’’-node  hjrpercube,  where  each  node  of  the  hypercube  simulates  an  entire  column 
of  the  butterfly,  we  have  an  embedding  algorithm  for  the  hypercube  which  aurhieves 
dilation  1  and  load  0(M/iV  +  log  N)  with  high  probability.  Thus  this  lemma  is 
sufficient  to  prove  theorem  4.1. 

The  general  idea  behind  the  proof  of  lemma  4.2  is  that  a  large  number  of  vertices 
will  be  placed  in  the  same  colunm  in  the  butterfly  only  if  the  flip  bits  on  the  paths 
leading  to  these  vertices  are  chosen  in  a  specific  (unlikely)  manner. 

A  stagnant  path  p  is  a  maximal  path  t;(l),u(2),...,t;(/)  in  T  with  v(l)  towards 
the  root  such  that  all  v{i)  are  placed  in  the  same  column  v  of  the  butterfly.  Let  the 
leader  of  p  be  the  ancestor  of  t;(l),  and  the  trace  of  p  be  the  set  of  n  +  /—  1  vertices 
between  the  leader  (inclusive)  and  v(/)  (exclusive).  If  w(l)  is  in  the  first  n  levels  of 
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the  tree,  then  the  leader  of  the  path  is  defined  to  be  the  root  of  the  tree. 

Notice  that  there  is  a  unique  path  in  the  butterfly  from  the  leader  of  a  stagnant 
path  p  to  vertex  u(l).  Thus,  given  the  column  in  which  the  leader  lies,  and  the  column 
in  which  the  path  p  lies,  we  can  completely  determine  the  flip  bits  chosen  along  the 
trace  of  the  path.  The  next  observation  is  that  the  traces  of  distinct  stagnant  paths 
mapped  to  the  same  column  dse  distinct;  i.e.  the  information  gained  from  one  trace 
is  different  from  that  obtained  in  the  other. 

Lemma  4.3.  Let  p  And  p'  be  two  distinct  stagnant  paths  placed  in  the  same  column 
of  the  butterfly.  Then  their  traces  are  vertex  disjoint  in  the  tree. 

Proof.  Contrary  to  the  lemma,  suppose  the  lowest  point  in  the  tree  at  which  the 
traces  intersect  is  vertex  u.  At  vertex  u,  the  two  traces  are  mapped  to  the  same 
column  of  the  butterfly.  Likewise,  the  two  stagnant  paths  are  mapped  to  the  same 
column.  The  two  children  of  u  are  mapped  to  different  columns  of  the  butterfly, 
however,  and  therefore  the  traces  must  reconverge  in  some  butterfly  column  between 
the  children  of  u  and  the  beginnings  of  the  two  stagnant  paths.  However,  the  two 
paths  caimot  meet  again  in  any  column  until  they  have  traversed  all  n  levels  of  the 
butterfly.  Since  the  two  stagnant  paths  are  at  a  distance  less  than  n  from  u,  the  traces 
cannot  reconverge  in  the  butterfly  before  reaching  them,  and  we  have  a  contradiction. 


Lemma  4.4.  For  any  column  v  of  the  butterfly,  there  is  at  most  one  stagnant  path 
mapped  to  v  such  that  v(l)  is  in  the  first  n  levels  of  the  tree. 

Proof.  This  lemma  follows  inunediately  from  lemma  4.3  by  noting  that  any  two  such 
paths  will  have  the  same  leader  (the  root  of  the  tree).  ■ 

Proof,  (of  Lemma  4.2)  We  shall  count  the  niunber  of  different  settings  of  the  flip 
bits  that  give  rise  to  some  column  having  at  least  C  —  k{M/2'*  +  n)  tree  vertices. 
This  can  be  done  as  follows: 

1.  Choose  the  column:  2”  choices. 

2.  Choose  the  number  of  stagnant  paths:  C  choices. 
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3.  Choose  the  endpoint  of  each  path:  where  Co  is  the  number  of  stagnant 

paths.  Define  ^  =  C/Cq. 

4.  Choose  the  length  of  the  paths:  choices. 

5.  Choose  the  flip  bits  at  all  vertices  in  T  except  those  in  the  Co  traces.  The  total 

number  of  flip  bits  is  M,  and  the  length  of  the  jth  trace  is  n  +  Ij  -  1,  except 
for  the  possible  case  when  one  stagnant  path  has  Vi  in  the  first  n  levels  of  the 
tree,  in  which  case  the  length  of  its  trace  is  /j  -  1.  Thus  the  total  number  of 
bits  this  step  fixes  is:  M  -  +  /j-l)4-n  =  M-  (Co(n  -  1)  +  C)  +  n.  Thus 

the  total  number  of  choices  is 

First  we  claim  that  the  above  choices  completely  determine  all  the  flip  bits.  To 
see  this,  consider  the  trace  with  its  leader  belonging  to  the  smallest  level  in  T,  of  all 
traces.  Clearly,  the  last  step  of  the  above  procedure  fixes  the  position  of  the  leader. 
This  fixes  all  the  bits  in  the  trace,  since  the  endpoint  and  the  length  of  the  trace  are 
known.  The  bits  for  the  other  traces  are  similarly  determined. 

The  total  number  of  ways  of  choosing  all  the  bits  is  2^.  Thus  the  probability  that 
some  column  gets  more  than  C  vertices  is  at  most 

^C+Coj2A#-(Co(n-l)+C) ^2^*^ 

<  2’"C 

To  go  &om  the  first  line  to  the  second  we  have  used  the  inequality  <  (ne/r)'. 
Choosing  k  >  lOe^,  and  noting  that  0{ff  +  1)  <  5(2^^^),  we  can  simplify  the  above 
expression  to: 

<  2’’*C2"^/’ 

<  2"^'^ 

<  2"*"^^ 
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Figure  4-1:  Level  balancing  a  tree,  n=6.  The  numerical  labels  indicate  the  stretch 
counts  chosen  at  those  nodes.  White  nodes  indicate  dummy  vertices. 


4.3  Embedding  in  the  Butterfly 

In  this  section  we  introduce  a  modification  to  the  embedding  algorithm  which  insures 
that  with  high  probability  the  nodes  of  the  binary  tree  are  distributed  evenly  among 
the  levels  of  the  butterfly.  We  then  prove  that  the  flip  bits  described  in  the  previous 
section  are  sufficient  to  distribute  the  tree  nodes  evenly  within  each  level. 


4.3.1  A  Level-Balancing  Transformation 

We  transform  the  tree  T  being  grown  by  selectively  inserting  dummy  vertices  into 
some  of  its  edges  during  the  growth.  Even  if  some  level  originally  has  a  dispropor¬ 
tionately  large  number  of  vertices,  the  newly  introduced  vertices  help  to  even  the 
distribution  of  the  tree  vertices  among  the  levels. 

The  n-way  level  balancing  transformation  is  as  follows.  Define  a  vertex  of  T  to 
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be  distinguished  if  it  lies  in  level  i  =  0  (mod  n/3).^  For  each  distinguished  vertex  v 
in  T  we  pick  a  random  number  5(u)  between  0  and  n/3  called  the  stretch  count.  We 
insert  a  single  dummy  vertex  in  each  of  the  edges  that  connect  v  to  its  descendants 
in  levels  i  -1-  1  through  i  +  S(v).  Figure  4-1  illustrates  the  transformation.  Note 
that  this  transformation  can  be  applied  as  the  tree  grows.  Each  node  only  needs  to 
know  what  level  of  the  tree  T  it  belongs  to,  and  the  stretch  count  generated  at  its 
nearest  distinguished  ancestor.  This  is  sufficient  information  to  decide  whether  or  not 
a  dummy  vertex  is  inserted  when  a  child  is  spawned. 

The  new  tree  B{T)  that  results  is  grown  on  the  butterfly  using  the  procedure 
described  in  section  4.2.2.  This  gives  a  dilation  1  embedding  for  B{T).  This  corre¬ 
sponds  to  a  dilation  2  embedding  of  T,  since  some  of  the  edges  in  T  were  replaced  by 
two  edges  in  B{T). 

4.3.2  Analysis  of  Tree  Balancing 

We  show  that  the  n-way  level  balancing  transformation  of  section  4.3.1  is  sufficient 
to  evenly  distribute  the  tree  vertices  among  the  levels  in  the  butterfly.  In  particular, 
we  show  that  for  any  tree  T,  no  level  set  in  B{T)  will  contain  a  disproportionately 
large  number  of  vertices.  Since  level  *  of  the  butterfly  receives  vertices  from  the  i‘* 
level-set  of  B{T),  this  implies  that  tree  vertices  are  uniformly  distributed  among  the 
butterfly  levels. 

Lemma  4.5.  For  an  arbitrary  tree  T,  the  n-way  level-balancing  transformation  gives 
a  tree  B(T)  such  that  the  total  number  ot  vertices  in  the  level-set  of  B{T)  is  at 
most  0{M/n  +  2")  with  high  probability. 

We  will  prove  the  following  slightly  modified  (but  equivalent)  version.  Define  the 

level  set  triple  of  a  tree  to  be  the  set  of  vertices  from  level  sets  t,  t  +  n/3  and 
i  +  2n/3.  Define  a  partition  of  T  into  3  zones  as  follows  (Figure  4-2).  Zone  0  consists 
of  vertices  in  levels  kn  through  kn  +  n/Z  —  1.  Zone  1  consists  of  vertices  in  levels 

‘In  what  follows  we  may  make  references  like  “(rood  x)"  or  “contribution  of  x  messages"  when 
X  may  not  be  integral.  Rounding  these  quantities  to  integers  does  not  affect  the  correctness  of  the 
proof.  For  ease  of  exposition,  we  shall  not  consider  the  issue. 


90 


Figure  4-2:  Subdivision  into  Zones,  and  a  forest  fj. 

kn  -f-  n/3  through  kn  -H  2n/3  —  1.  Zone  2  consists  of  vertices  in  levels  kn  -I-  2n/3 
through  (ib  -H  l)n  —  1.  Each  zone  consists  of  a  number  of  trees  of  maximum  height 
n/3.  We  will  show  that  no  level  set  triple  of  B{T)  will  receive  more  than  0{Mln) 
vertices  from  any  zone  of  T,  with  high  probability.  Lemma  4.5  follows  because  there 
are  only  3  zones,  and  since  the  number  of  vertices  in  a  level  set  triple  upper  bounds 
the  number  of  vertices  in  a  level  set. 

The  key  observation  is  that  each  zone  can  be  partitioned  into  a  set  of  forests 
/it  /z)  •  •  •  t  /a  that  contribute  independently  to  level  set  triple  i,  for  any  t.  We  illustrate 
the  partitioning  for  zone  1.  Each  fj  consists  of  all  trees  from  zone  1  between  levels 
kn  +  nlZ  and  kn  +  2n/3  —  1  that  have  a  common  ancestor  rj  at  level  ibn,  for  some 
fixed  k.  Other  zones  are  partitioned  similarly. 

Lemma  4.6.  Let  Xj  denote  the  number  of  zone  1  vertices  from  a  forest  fj  placed 
in  level  set  triple  i  of  B{T).  Then  all  variables  Xj  are  mutually  independent,  and 
E{Xj)  =  3Mj/n,  where  Mj  is  the  number  of  vertices  in  fj. 

Proof.  Let  variable  Yj  denote  the  level  set  triple  into  which  the  roots  of  the  trees 
in  fj  are  placed.  By  definition,  these  roots  are  all  placed  in  the  level  set  triple  given 
by  the  level  set  triple  of  rj  plus  5(r,),  mod  n/3.  Since  the  stretch  counts  of  the  r/s 
are  uniformly  selected  from  [0,  n /3]  and  are  mutually  independent,  it  follows  that  the 
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y^’s  are  also  uniformly  selected  from  [0,n/3]  and  are  mututally  independent.  Since 
Xj  is  completely  determined  by  Yj  and  the  stretch  counts  chosen  at  the  roots  of  trees 
in  /j,  it  follows  that  the  Xj  are  mutually  independent,  and  that  E(Xj)  =  ZMjfn,  I 

Similarly,  this  lemma  holds  for  any  other  zone  of  the  tree  T,  except  for  the  first 
section  of  zone  0,  which  contains  the  vertices  in  levels  0...n/3  —  1.  However,  this 
segment  of  the  tree  contains  at  most  2"^^  —  1  nodes,  which  will  be  mapped  one-to-one 
to  nodes  of  the  butterfly. 

Proof,  (of  Lemma  4.5)  The  Xj  are  independent  random  variables.  Clearly,  no  Xj 
can  contribute  more  than  2^"^^  vertices,  since  the  forest  is  part  of  a  tree  of  height  no 
more  them  2n/3.  The  mean  of  each  Xj  is  ZMj/n,  where  Mj  is  the  number  of  vertices 
in  fj\  therefore  the  mean  of  X{=  ^Xj)  is  at  most  ^Mj  <  ZM/n.  We  have  by  the 
independence  of  the  Xj  that  for  amy  t,  E[e^^\ 

=  Pi  £[*‘•'•1 
=  Hi  =  ■')«“ 

As  in  lemma  2.7,  the  expectation  is  maximized  when  only  the  events  [Xi  =:  0]  and 
[Xi  =  2***^^]  have  positive  probability.  Suppose  there  were  some  value  *,  not  equal  to 
0  or  2^"^^,  such  that  Pr[Xi  =s  x]  =  5  >  0.  Then  by  the  convexity  of  changing 
Pr[Xi  =  i]  to  0  and  setting  Pr[Xi  =  x  —  1]  =  Pr[Xi  =  x  +  1]  =  5/2  would  increase 
the  expectation  of  It  follows  that  in  order  to  maximize  the  expectation,  the  two 
endpoints  of  the  interval  must  be  the  only  events  with  positive  probability.  If  we  use 
Markov’s  inequality  to  put  an  upper  bound  on  Pr[Xi  =  2*"^®]  then 

^(‘"1  S  n((l-^)+^e"*) 

Again  using  Markov’s  inequality,  we  obtain  for  any  constant  5,  Pr[X  >  ZbMIn] 

=  Pr(e‘-^  >  e®**^*/**] 
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^  -  filUt/n - 

This  quantity  is  minimized  at  t  =  /n6/2^"/^.  At  this  value  of  t,  and  as  long  as 
M  >  this  quantity  is  smaller  than  iV"*  for  some  constant  k  which  can  be 

made  as  large  as  desired  by  choosing  b  sufficiently  large.  ■ 

4.3.3  Effectiveness  of  Flip  Bits 

We  now  show  that,  given  the  effectiveness  of  the  level- balancing  adgorithm,  the  flip 
bits  suffice  to  distribute  the  tree  nodes  within  the  levels  of  the  butterfly. 

Lemma  4.7.  Let  Wi  denote  the  total  number  of  vertices  in  level  set  i  in  an  arbitrary 
binary  tree  T.  When  T  is  grown  on  a  butterSy  with  n  leveb,  no  processor  from  any 
level  i  receives  more  than  0(Wi/2’*  +  n)  vertices  with  high  probability,  for  all  i. 

In  other  words,  whenever  W,  >  n2’‘,  each  of  the  2"  processors  in  level  i  will  receive 
roughly  the  same  number  of  tree  vertices. 

The  key  to  the  proof  is  the  observation  that  the  vertices  placed  on  a  processor 
can  be  attributed  to  a  large  number  of  mutually  independent  sources.  To  see  this, 
partition  T  into  subtrees  TiyT},...  where  each  subtree  is  rooted  at  some  vertex  in 
level  kn  +  i  and  consists  of  all  the  descendants  of  the  that  vertex  between  levels 
kn  +  i  +  1  and  kn  +  i-kn  (figure  4-3). 

Lemma  4.8.  At  most  one  level  n  vertex  from  each  subtree  Tj  will  be  placed  on  any 
processor  p  on  level  i  of  the  butterfly.  The  probability  of  a  vertex  from  Tj  being  placed 
on  processor  p  is  where  Wj  denotes  the  number  of  vertices  in  level  n  of  tree  Tj. 

Further  the  contributions  of  the  different  subtrees  to  p  are  mutually  independent. 

Proof.  Any  tree  Tj  can  have  at  most  2**  voices  at  level  n,  and  the  growth  algorithm 
guarantees  that  these  will  be  placed  on  distinct  processors  within  a  single  level.  Thus 
we  know  that  at  most  one  vertex  from  a  tree  Tj  will  be  placed  on  a  given  processor 
p  in  level  t  of  the  butterfly. 

It  follows  from  the  above  that  the  number  of  vertices  from  Tj  placed  on  p  is  a 
random  variable  with  value  either  0  or  1.  The  probability  that  any  given  vertex  from 
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Figure  4-3:  The  tree  T  and  its  partition,  t  =  1,  n  =  2. 

level  n  of  Tj  will  be  placed  on  p  is  1/2’*,  so  the  expectation  of  this  random  variable 
is  tWj/2".  Since  the  value  of  the  random  variable  can  only  be  0  or  1,  Wj/2^  must  be 
the  probability  that  it  is  1.  Thus  the  probability  of  a  vertex  from  Tj  being  placed  on 
p  is  Wjf2^. 

The  independence  between  different  subtrees  follows  because  the  flip  bits  in  each 
subtree  are  picked  independently.  ■ 

To  complete  the  proof  of  lenuna  4.7,  we  need  the  following  lemma,  due  to  Hoeffding 

([H]). 

Lemma  4.9.  [Hoeffding]  If  we  have  L  independent  Bernoulli  trials  with  respective 
probabilities  pi, . . . ,  pt,  with  Lp  =  ^  pi,  and  m  >  Lp+1  is  an  integer,  the  probability 
of  at  least  m  successesa  is  at  most  B{m,L,p),  where  B{m,L,p)  <  {Lpelm)”^. 

Proof,  (of  Lemma  4.7)  The  number  of  vertices  placed  at  a  processor  is  the  sum  of 
independent  random  variables  corresponding  to  each  tree  Tj.  The  expected  number 
of  vertices  is  =  Wjf2*.  The  probability  that  some  processor  receives  more 

than  k{n  +  Wj/2")  vertices  is  at  most  (using  lemma  4.9): 

Thus  the  probability  that  one  of  the  2"  processors  in  any  of  the  n  levels  receives  more 
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than  k{Wjf2"‘  +  n)  vertices  is  at  most 

for  some  constant  ■ 

Theorem  4.10.  An  arbitrasy  binary  tree  T  with  M  vertices  can  be  grown  dynami¬ 
cally  on  an  N  processor  butterBy  network  with  dilation  2  such  that  with  high  prob¬ 
ability  the  maximum  load  per  processor  is  at  most  0{M/N  +  log  JV). 

Proof.  By  lemma  4.5,  with  high  probability  we  have  Wi  =  0{M/n  +  2")  for  all  i, 
and  by  lemma  4.7  ,  with  high  probability  no  processor  in  level  i  will  receive  more 
than  0(n  +  VVi/2")  vertices.  Thus  with  high  probability,  fewer  than  0{logN  +  M/N) 
vertices  are  mapped  to  any  processor.  I 


4.4  An  Improved  Hypercube  Embedding 

The  butterfly  can  be  embedded  in  the  hypercube  with  dilation  2  such  that  each  level 
of  the  butterfly  is  a  subcube  of  the  hyparcube.  Therefore  we  can  have  the  hypercube 
simulate  any  embedding  algorithm  for  the  butterfly,  with  a  unique  2’^-node  subcube 
simulating  each  level.  We  will  take  advantage  of  this  by  using  a  scheme  which  has  each 
level  (subcube)  receiving  only  0{Mfn  +  2")  tree  nodes,  and  developing  a  method  for 
local  distribution  within  these  subcubes  which  will  reduce  the  load  on  each  indvidual 
processor  while  guaranteeing  low  congestion.  We  begin  with  some  preliminaries. 

4.4.1  Embedding  the  Butterfly  and  Star  Covers 

Let  G{x)  be  the  Grey  code  value  of  the  binary  string  x,  defined  by  G(xio,,i  •  •  •  ^i)  = 
xiof  n|2iogn0xio|fi-i|  •  •  •  |x30Xi.  For  any  bit  string  z,  G{x)  and  G((x+1)  mod  n)  differ 
in  exactly  one  bit  position.  For  an  integer  t,  let  bin{i)  be  the  binary  representation  of 
i.  The  embedding  which  maps  butterfly  processor  vj  to  node  G(frtn(0)|^n(v)  of  the 
hypercube  has  dilation  2  and  maps  each  level  of  the  butterfly  to  a  distinct  2'*-node 
subcube  of  the  hypercube.  Also  note  that  within  each  level  /,  if  v  and  differ  in 
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exactly  one  bit,  then  there  is  a  hypercube  edge  between  the  embedded  locations  of 
the  nodes  vi  and  vj‘. 

For  any  node  i  of  a  2’*-node  hypercube,  we  define  the  full  star  centered  at  x  to  be 
the  set  of  nodes  consisting  of  x  along  with  the  n  nodes  adjacent  to  x.  The  existence 
of  perfect  one-error-correcting  codes  implies  that  when  n  =  2”*  —  1  for  some  integer  m 
there  exists  a  collection  of  2”/n  -I- 1  full  stars  such  that  every  node  of  the  hypercube 
belongs  to  precisely  one  star  in  the  collection. 

Suppose  n  is  not  of  this  form.  Consider  the  largest  n'  such  that  n'  <n  and  n'  is  of 
the  form  n'  =  2"*  —  1;  then  n'  >  n/2.  We  can  partition  the  hypercube  into  subcubes 
of  2"'  nodes,  and  cover  each  of  these  with  full  stars.  This  star  cover  perfectly  covers 
the  nodes  of  the  2’'-node  hypercube.  Each  star  in  the  star  cover  consists  of  a  node  x 
and  some  subset  of  @(n)  (in  this  case  at  least  >)  of  its  neighbors. 

Choose  a  star  cover  for  a  2”-node  hypercube,  and  duplicate  this  cover  in  each 
subcube  of  the  N{=  n2")-node  hypercube  which  corresponds  to  a  level  of  the  butterfly. 
This  collection  of  stars  yields  a  star  cover  of  the  iV-node  hypercube;  call  it  C. 

4.4.2  Modifying  the  Embedding  Algorithm 

Our  discussion  of  the  hypercube  algorithm  has  two  parts: 

1.  We  describe  a  modified  algorithm  for  embedding  on  the  butterfly  which,  when 
simulated  on  a  hypercube,  maps  at  most  0(Af/2"  +  n)  tree  nodes  to  any  star 
in  the  cover  C,  with  high  probability. 

2.  We  show  how  to  deterministically  redistribute  the  load  within  a  star  of  the 
hypercube  among  its  nodes  in  such  a  way  that  each  node  receives  0{M/N  + 
1)  tree  nodes,  the  dilation  remains  constant  and  the  resulting  congestion  is 
0{M/N+1). 

We  begin  by  showing  how  to  modify  the  butterfly  embedding  algorithm  given  in 
the  previous  section  so  that  when  it  is  simulated  on  the  hypercube,  the  amount  of 
load  assigned  to  any  star  in  the  cover  C  is  balanced. 


96 


We  will  modify  our  embedding  algorithm  as  follows.  Use  the  embedding  algorithm 
from  the  previous  section,  but  where  previously  we  placed  the  children  of  a  tree  node 
V  €  B(T)  which  was  embedded  in  level  I  into  level  /  +  1,  choosing  their  locations  by 
a  random  flip  bit,  we  will  now  place  the  first  child  of  v  into  level  /  +  2,  using  a  pair 
of  flip  bits  to  determine  its  position  within  the  level,  and  placing  the  second  child  (if 
it  exists)  at  the  location  in  that  level  determined  by  complementing  both  flip  bits.  It 
is  clear  that  this  will  increase  dilation  by  a  factor  of  two. 

Since  we  are  embedding  the  level- balanced  tree  B(T),  we  know  that,  with  high 
probability,  each  level-set  of  the  tree  contains  0{Mjn  +  2’*)  nodes.  As  in  lemma  4.6, 
we  observe  that  the  vertices  placed  in  a  single  star  come  from  many  independent 
sources. 

Partition  B{T)  into  subtrees  TifTj . . .  in  such  a  way  that  the  root  of  each  subtree 
is  embedded  at  level  /  +  2  in  the  butterfly  (or  level  1  +  1  if  n  is  odd)  and  each  subtree 
contains  the  descendants  of  its  root  down  to  the  nodes  embedded  at  level  /  in  the 
butterfly. 

Lemma  4.11.  Consider  an  arbitrary  star  S  in  C,  contained  in  level  I  of  the  butter- 
By.  Then  at  most  two  vertices  from  each  subtree  can  be  placed  on  processors  in  S. 
Furthermore,  the  contributions  of  each  subtree  to  S  are  mutually  independent. 

Proof.  Any  subtree  can  have  at  most  2"^’  vertices  placed  in  level  /  of  the  butterfly, 
and  these  will  necessarily  be  placed  at  distinct  locations  within  the  level.  Suppose 
that  three  vertices  from  the  same  subtree  were  mapped  to  the  star  5.  Since  the  flip 
bits  are  chosen  in  pairs,  any  pair  of  these  vertices  must  be  mapped  to  locations  which 
differ  in  an  even  number  of  bits;  since  they  are  all  mapped  to  the  same  star,  any  pair 
of  them  must  differ  in  exactly  two  bit  positions.  Consider  the  paths  to  each  of  these 
three  vertices  from  their  lowest  common  ancestor;  call  this  vertex  x.  Clearly,  two  of 
the  vertices  must  be  descendants  of  one  child  of  z,  and  one  must  be  a  descendant  of 
the  other.  The  vertex  (call  it  y)  which  is  the  lone  descendant  of  one  of  the  children 
of  X  now  differs  from  both  of  the  other  two  vertices  in  two  bit  positions  which  are 
not  corrected  elsewhere  in  the  tree.  However,  at  some  point  the  paths  of  the  other 
two  vertices  diverge  (since  they  are  placed  on  different  processors  in  level  /),  and  y’s 
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path  cannot  duplicate  the  flip  bits  on  both  paths  simultaneously.  Therefore  y  differs 
from  one  of  the  other  two  vertices  in  at  least  four  bit  positions,  contradicting  the 
supposition  that  all  three  vertices  were  in  the  same  star  in  level  /.  Therefore  at  most 
two  vertices  from  the  same  subtree  can  be  placed  in  the  star  S. 

The  independence  between  different  subtrees  follows  from  the  fact  that  the  flip 
bits  are  picked  independently  in  each  subtree.  I 

Lemma  4.12.  We  can  embed  an  arbitrary  binary  tree  T  with  M  nodes  into  an  N- 
node  bypercube  such  that,  with  high  probability,  no  star  in  the  cover  C  receives  more 
than  0(A//2"  +  n)  tree  nodes. 

Proof.  Consider  an  arbitrary  st2ir  5  in  level  /  from  the  cover  C.  Let  Xi  be  the 
number  of  tree  nodes  from  subtree  Ti  which  are  assigned  to  processors  in  S.  The  Xi 
are  independent  random  variables,  each  with  maximum  value  2  (from  lemma  4.11)  and 
mean  @(mtn/2"),  where  is  the  number  of  leaves  of  the  subtree  Ti.  It  follows  that 
the  mean  of  X  =  ^  Xi  is  @(mn/2'*),  where  m  is  the  number  of  tree  nodes  embedded 
into  level  t  of  the  butterfly.  But  since  we  are  balancing  levels  by  embedding  the  tree 
B(T),  we  have  m  =  OiMjn  +  2"),  so  that  the  mean  of  X  is  less  than  cii{M/2^  +  n) 
for  some  constant  Cis  >  0.  By  the  same  argunoent  as  in  the  proof  of  lenuna  4.5,  we 
can  bound  the  expectation  of  the  random  variable  by 

£[«'•')  <  exp  (^(M/2"  +  n)(e»  -  1)) 

Again  as  in  lemma  4.5,  we  obtain  for  any  constant  6, 

Pr[X>^{M/2’^  +  n)] 

=  Pr[e*^  > 


This  value  is  minimized  at  t  s  lnb/2,  at  which  point  this  quantity  is  smaller  than 
N~^  for  k  which  can  be  made  as  large  as  desired  by  choosing  b  suflBciently  large.  B 
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4.4.3  Redistributing  Load  Within  Stars 

With  high  probability,  each  star  in  the  cover  has  at  most  0(A//2’*  +  n)  tree  nodes 
assigned  to  its  0(n)  nodes.  We  would  like  to  redistribute  the  0(M/2"  +  n)  load 
on  each  star  evenly  among  the  0(n)  nodes  of  the  star,  using  the  hypercube  edges 
connecting  butterfly  nodes  within  the  same  level,  so  that  two  conditions  hold: 

1.  Each  node  gets  at  most  0(MfN  +  1)  load. 

2.  We  can  choose  paths  of  constant  length  between  the  redistributed  locations  of 
adjacent  tree  nodes  so  that  the  congestion  on  any  hypercube  edge  is  at  most 
0{MfN+l). 

If  these  two  conditions  can  be  achieved  by  a  redistribution  scheme  which  runs 
dynamically  as  the  tree  is  embedded  then,  with  high  probability,  the  embedding 
algorithm  achieves  load  0{M/N-^  1),  dilation  0(1),  and  congestion  0{MIN  +1)  — 
simultaneously  optimizing  load  and  dilation  to  within  constant  factors.  In  addition, 
the  congestion  will  be  optimal  if  Af  =  0{N). 

Place  an  0(MfN+ 1)  upper  limit  (with  appropriate  choice  of  constant  depending 
on  the  constant  in  lemma  4.12  ai^d  the  number  of  elements  in  each  star)  on  the  number 
of  tree  nodes  which  can  be  assigned  to  a  single  node.  All  additional  load  is  sent  to 
some  other  node  in  the  star  which  has  room.  It  is  clear  that  we  have  siifficient  capacity 
over  each  star  to  handle  the  load,  and  that  we  will  still  ha'^e  constant  dilation.  In 
addition  we  will  have  mAirimum  load  0(Af/iV+l)  at  each  node  of  the  h3rpercube.  Note 
that  this  is  not  allowing  process  migration— each  tree  node  is  redistributed  before  it  is 
embedded  into  the  hypercube.  Once  the  node’s  redistributed  location  is  determined, 
it  is  embedded  there  permanently. 

Suppose  we  redistribute  one  tree  node  &om  node  w*  to  node  in  the  star  centered 
at  t;  in  the  hypercube  (load  coming  from  or  going  to  the  center  is  redistributed 
directly).  This  load  is  passed  along  the  path  v*  -»  v*^  -*  rather  than  through  the 
center  of  the  star. 
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Lemma  4.13.  If  sdl  load  being  redistributed  among  points  of  stars  is  follows  paths 
of  the  form  v'  -*  -*  rather  than  paths  through  the  centers  of  stars,  then  the 

resulting  congestion  due  to  this  redistribution  is  0{M/N  +  1). 

Proof.  For  each  star  in  the  cover,  consider  the  corresponding  extended  star,  which 
consists  of  the  star  centered  at  u,  plus  all  vertices  t;*'*  such  that  both  u*  and  are  in 
the  star.  The  edges  in  the  extended  star  consist  precisely  of  those  paths  2dong  which 
load  can  be  redistributed  in  the  star  centered  at  v.  The  redistribution  within  that  star 
can  add  at  most  congestion  0(MIN-\-l)  to  any  of  the  edges  in  the  extended  star.  Ail 
that  remains  is  to  observe  that  any  edge  in  the  hypercube  is  in  at  most  two  extended 
stars.  Thus  the  total  congestion  it  receives  from  redistribution  is  0(M/N  +  1).  ■ 

Let  /  be  the  level  of  the  butterfly  to  which  u  is  mapped;  then  v  is  mapped  to  level 
1  +  2.  Furthermore,  their  positions  within  thdr  respective  butterfly  leveb  differ  in  at 
most  two  bit  positions  (before  redistribution).  We  consider  here  the  case  where  both 
u  and  V  are  both  initially  mapped  and  redistributed  to  some  point  of  a  star  rather 
than  the  center.  When  one  or  both  of  them  is  mapped  to  the  center  of  a  star,  the 
argument  is  even  simpler. 

Let  X  and  y  be  the  centers  of  the  stars  to  which  u  and  v,  respectively,  are  mapped. 
Let  p  and  q  be  the  dimensions  within  the  star  to  which  u  is  mapped  and  redistributed, 
and  likewise  r  and  s  for  v.  Let  /i,/i  be  the  flip  bits  selected  when  v  is  embedded  as  a 
child  of  u.  We  then  define  the  path  from  u,  which  is  redistributed  to  z*  in  level  /,  and 
V,  which  is  redistributed  to  y*  in  level  1  +  2,  aa  follows  (this  procedure  is  illustrated 
in  figure  4-4): 

1.  Move  from  level  /  to  level  /  4- 1  to  level  1  +  2  along  the  edges  determined  by  the 
flip  bit8  7i,/2. 

2.  Flip  the  bits  in  positions  p,  then  q,  in  effect  undoing  the  redistribution  of  u 
which  was  performed  in  level  1.  We  are  now  at  y',  the  original  location  of  v 
before  redistribution. 
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Figiire  4-4:  The  path  chosen  between  redistributed  node  locations.  The  dashed  lines 
indicate  the  path  determined  by  the  flip  bits,  before  redistribution.  The  first  pair 
of  directed  edges  also  show  this  choice  of  flip  bits.  The  second  pair  undoes  the 
redistribution  at  level  /.  The  last  pair  balances  the  load  at  level  /  -I-  2. 

3.  Flip  the  bits  in  positions  s,  then  r.  This  takes  us  to  y*,  the  redistributed  location 
of  V  in  its  star  in  level  /  +  2. 

In  order  to  show  that  the  congestion  is  0{MIN+ 1)  in  this  case,  it  suffices  to  show 
two  things.  First,  that  the  congestion  along  each  edge  of  the  butterfly  is  OiM/N -f- 1). 
Second,  that  the  congestion  along  each  hypercube  edge  connecting  nodes  within  a 
butterfly  level  is  0{M/N+1).  From  these  two  facts  it  follows  that  the  total  congestion 
is  0{MfN^l). 

Consider  an  arbitrary  butterfly  edge.  There  are  at  most  two  nodes  of  the  butterfly 
which,  when  choosing  the  paths  to  their  descendants,  can  use  that  edge.  Since  after 
redistribution  each  of  these  nodes  has  load  0(M/iV  -H  1),  the  congestion  along  the 
edge  being  considered  can  also  be  at  most  O^MfN  1). 

The  congestion  on  hypercube  edges  connecting  butterfly  nodes  within  a  level  has 
two  sources:  (1)  the  redistribution  of  the  nodes  embedded  to  that  level,  and  (2) 
undoing  the  redistribution  of  the  parents  of  the  nodes  embedded  to  that  level. 

It  follows  directly  from  lemma  4.13  that  the  total  congestion  from  the  first  source 
does  not  exceed  0(Af/iV  +1).  We  can  break  up  the  congestion  derived  from  the 


second  source  into  four  sets,  according  to  the  flip  bits  chosen  along  the  paths  from 
the  parents  of  the  nodes  in  the  level  we  are  considering.  Each  fixed  setting  of  flip 
bits  determines  a  bijective  map  of  the  nodes,  and  therefore  the  jump  edges,  from 
two  levels  above  to  the  current  level.  The  congestion  on  any  edge  from  undoing 
the  redistribution  of  parents  equals  the  congestion  on  its  preimage  from  the  original 
redistribution.  The  congestion  in  each  set  is  therefore  0(M/jV  +  1),  and  so  the  total 
congestion  derived  from  undoing  the  redistributions  is  also  0[MfN  +  1).  It  follows 
that  the  entire  congestion  on  any  edge  is  0(Af/N  +  1). 

Theorem  4.14.  An  arbitrary  tree  T  with  M  vertices  can  be  grown  on  a  N  processor 
hypercube  with  constant  dilation  such  that  with  high  probability  the  maximum  load 
is  0{M/N  +  1)  and  the  congestion  is  0(A//iV+  1). 

4.5  A  Lower  Bound  for  Deterministic  Algorithms 

In  this  section,  we  prove  that  any  deterministic  algorithm  for  dynamically  embedding 
an  M 'node  tree  in  an  iV-node  hypercube  (M  >  N)  which  maintains  maxjirivTn  load 
must  have  not  only  maximum  but  average  dilation  n(>/IoglV'/a*).  It  follows 
that  any  deterministic  embedding  algorithm  which  achieves  OiMfN  +  1)  load  must 
necessarily  result  in  embeddings  with  dilation  ft(VlogW)  for  some  trees.  Thus  any 
embedding  algorithm  which  simultaneously  optimizes  maximum  load  and  dilation  (to 
within  constant  factors)  must  be  randomized. 

Theorem  4.15.  Any  deterministic  algorithm  for  dynamically  embedding  trees  in  an 
N-node  hypercube  which  achieves  load  aM/N  for  a  tree  with  M  (>  IV)  nodes  must 
have  average  edge  length  n(v/logTV/a’). 

Proof.  Let  aM/N  be  the  load  maintained  by  the  embedding  algorithm  when  embed* 
ding  an  M*node  tree.  Define  the  size  of  a  node  in  the  hypercube  to  be  the  number 
of  I’s  in  the  n*bit  string  associated  with  the  node.  Partition  the  hypercube  into  6a 
blocks,  each  block  corresponding  to  some  range  of  node  sizes  and  containing  N/6a 
nodes.  Since  there  are  at  most  0(N/ y/logN)  nodes  of  any  size,  each  block  must 
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contain  at  least  (}( v^og  N/a)  sizes.  This  means  that  any  two  nodes  which  are  in 
non-adjacent  blocks  are  at  distance  0(  vIogTT /a)  from  each  other. 

Choose  an  arbitrary  M  >N,  and  grow  a  path  of  A//2  nodes,  starting  at  the  root. 
At  this  point,  some  block  must  contain  M/12a  tree  nodes;  choose  such  a  block.  We 
will  continue  growing  the  tree  from  the  M/I2a  nodes  in  the  chosen  block.  Grow  paths 
from  each  of  these  tree  nodes  simultaneously,  stopping  each  path’s  growth  when  it 
reaches  a  hypercube  node  which  is  neither  in  the  chosen  block  nor  in  a  block  adjacent 
to  it.  The  total  number  of  nodes  in  the  chosen  block  and  adjacent  blocks  is  at  most 
N /2a-,  since  the  algorithm  maintains  load  aM/N,  this  set  of  nodes  contains  at  most 
{aM/N){N/2a)  =  M/2  tree  nodes.  It  follows  that  the  total  length  of  the  M/\2a 
paths  grown  is  at  most  M/2.  This  verifies  that  the  tree  being  considered  has  at  most 
M  nodes. 

Now  we  can  calculate  the  average  edge  length.  Since  each  of  the  A// 12a  paths 
connects  a  node  in  the  chosen  block  to  a  node  in  some  non-adjacent  block,  the  total 
edge  length  in  these  paths  is  at  least  {M/I2a)  x  fl(%/Iog77/a)  =  ft(Af v^og  iV/a*). 
Since  the  entire  tree  contains  at  most  M  edges,  it  follows  that  the  average  edge  length 
of  the  embedding  is  fl(VloglV'/a*).  ■ 


4.6  Remarks 

The  embedding  in  section  4.4  achieves  dilation  at  most  12.  One  edge  of  T  corresponds 
to  at  niost  two  edges  of  B(T),  each  of  which  corresponds  to  two  butterfly  edges.  In 
the  embedding  of  the  butterfly  into  the  hypercube  each  butterfly  edge  corresponds  to 
two  edges  of  the  hypercube.  The  redistribution  algorithm  adds  at  most  four  edges  to 
the  resulting  path  for  a  total  of  12  hypercube  edges.  By  combining  the  techniques  of 
section  4.4  with  those  of  section  4.2,  we  can  reduce  this  to  6  or  7  with  no  increase  in 
load  or  congestion. 

It  is  also  likely  that  we  can  improve  the  bound  on  congestion  to  0(M/jV  log  N+1) 
for  hypercube  embeddings  by  combining  the  techniques  in  section  4.4  with  those  of 
section  4.2.  We  suspect  that  this  bound  is  tight  for  all  on-line  algorithms,  but  we 
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can  prove  a  bound  of  Q{M/N\ogN  +  1)  only  for  deterministic  on-line  algorithms. 
Any  M-node  binary  tree  can  be  embedded  off-line  in  an  iV-node  hypercube  with  load 
0{MIN  +  1)  and  constant  dilation  and  congestion. 

Although  we  have  not  worked  out  the  details,  we  suspect  that  our  embedding 
algorithms  also  work  for  trees  that  can  shrink  from  the  top  as  well  as  grow  and  shrink 
from  the  bottom,  and  that  they  can  be  made  to  work  for  arbitrary  trees  of  small 
degree.  We  aJso  expect  that  our  techniques  will  prove  useful  for  finding  embeddings 
in  other  networks,  such  as  the  shuffle-exchange  graph. 
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